The goal of our project is to identify and create a prioritized list of common and unsequenced members of the microbiome for whole genome sequencing. We assert that a modest sequencing effort (on the order of one hundred “most wanted” taxa described in ) combined with existing databases will result in genome sequences being available for a large majority of the most common microbial taxa present in the human microbiome. Generation of such a resource will assist in the ongoing efforts to understand how pathways encoded in microbial genomes contribute to human health and disease phenotypes and make more tractable phylogenetic assignment for short read whole-genome metagenomic experiments.
The number of OTUs determined to be “high priority”, “medium priority” or “low priority” for full genome characterization.
We were able to achieve a simplified view of the human microbiome where a modest number of taxa (on the order of 1,000) were able to capture ~95% of all the V1–V3 and V3–V5 HMP sequences (). The majority of the sequences not contained in an OTU were chimeric () suggesting a high rate of error in unincorporated sequences. For this reason, we ignored all sequences not incorporated into an OTU. Removal of OTUs that had a chimeric consensus sequences (see Document S1) further simplified our view of the taxa present in the HMP OTUs with on the order of ~800 non-chimeric OTUs found for both the V1–V3 and V3–V5 sequence sets (). These non-chimeric OTUs generally had a very close match in the Silva database (). Because the Silva database largely reflects uncultured taxa, this is unsurprising. The most prevalent OTUs found in the HMP V1–V3 dataset were clearly also present in stool (), saliva () and vaginal samples () from other cohorts. It was, therefore, clear that many of the same taxa occur across different subjects in multiple cohorts. While the 16S sequences representing these taxa were repeatedly observed across different experiment sets, many of these taxa have not yet been captured in culture collections () or characterized with whole genome sequencing ().
The initial observations based on 454 pyrosequencing reported what appears to be near infinite diversity in environmental habitats 
. It currently remains unclear the degree to which such diversity reflects rare sequencing errors and chimerism. Because our study utilized the program AbundantOTU 
, which required construction of a consensus sequence from multiple reads in order to form an OTU, our analysis path deliberately avoided rare taxa. Our study is, therefore, neutral to the question of whether the rare biosphere represents true novel taxa or sequencing or PCR error. Moreover, our assignment of any taxa that was not seen in, at least, 20% of all samples from any body habitat to the “low priority” group further weights our priority lists against taxa that are not highly prevalent. We assert that emphasizing the sequencing of the most prevalent taxa first represents a rational deployment of sequencing resources. Of course, a limitation of this or any study that relies on the 16S rRNA view of a microbial community is that this one gene may not perfectly reflect the content of the rest of the genome or the evolutionary distance between organisms. Genome sequences will assist in this regard.
In this paper, we used percent identity from a global alignment as our metric to compare a query sequence to a reference database. Percent identity has some obvious advantages over other metrics. First, it is easy to calculate and makes intuitive sense, even to those without a background in phylogeny. Second, in a recent paper 
, it was shown that percent identity based on global alignments yielded more accurate matches to reference databases than a local alignment strategy based on best BLAST hit. There are, however, obvious disadvantages with the use of percent identity as a distance metric as well. Percent identity does not correct for different rates of evolution in different regions of the 16S sequence. This likely explains why we observed more “most wanted” V1–V3 OTUs than V3–V5 because the rate of evolution of V1–V3 is known to be more rapid 
. An approach based on phylogenetic trees may have corrected for these sorts of differences by normalizing the background rate of evolution. We assert, however, that our collection of “most wanted” OTUs would be similar even if we had taken such an approach. When we used the phylogenetic tree-based method, pplacer 
, to place the HMP OTUs into a reference tree of sequenced taxa, we observed a highly overlapping set of HMP OTUs that were most distant from sequenced taxa on this tree and the “most wanted” taxa based on the global alignment criteria (data not shown). We are confident, therefore, that our results are not fundamentally a product of our choice of distance metric.
The HMP cohort was designed to measure the variation within healthy individuals. We would expect, therefore, that there will be some pathogenic taxa that are associated with disease that are not prevalent within the HMP cohort. We would anticipate future sequencing efforts to capture the genomes of these disease-associated microbes. As the cost of sequencing continues to decrease, and Illumina sequencing of 16S sequences becomes more common, the number of sequences per sample will increase well beyond the ~6,000 sequences seen on average in HMP samples. In these future metagenomic sequencing experiments, some low abundance taxa that were not regularly detected with the sequencing depths of the 454-based HMP OTUs may appear as more highly prevalent. Nonetheless, given the current view of the human microbiome that is generated with the HMP OTUs through 454 sequencing technology, we assert that our list of high priority taxa is a reasonable use of resources to fill in the gaps of the phylogenetic tree representing the human microbiome.
The National Institutes of Health (NIH) is actively supporting the development of new culture- and single cell-based methods for bringing “most wanted” organisms to the sequencer. So far, the results have been very promising; the HMP is currently sequencing new isolates and single cells representing priority organisms. Though these efforts will continue, we appeal to the broader community to use the “most wanted” list, available at http://hmpdacc.org/most_wanted
, to expand culture and genome collections to include these elusive members of the healthy microbiome. Finally, we believe that the simplified analysis path used to create the “most wanted” list can also be used to measure and direct progress of whole genome sequencing and culturing efforts for ongoing and future microbiome-related studies, human-related or otherwise.