We discovered a number of new bacterial taxa, all but three of which were represented in multiple subjects from a larger set of stool samples from over 80 healthy individuals. These findings establish that bacterial taxa remain to be discovered, even in well-studied microbial communities such as the healthy human gut. We also showed that the number of false-positive novel sequences identified in the 11 samples would have been two orders of magnitude higher than the true number of novel taxa without validation by multiple datasets. For example, the discovery of novel bacterial taxa using metagenomic 16S rDNA sequence data was complicated by artifacts, such as sequencing errors and chimerism, which generated false signals of novelty. This study highlights the importance of establishing rigorous standards for the identification of novel taxa in metagenomic data. The pipeline we developed is an important first step towards establishing experimental and bioinformatics criteria for these standards.
Specifically, we found that leveraging Illumina WGS data to validate potentially novel Roche variable region and WGS 16S rDNA sequences enabled, with a high level of confidence, the identification of novel taxa in stool samples from healthy subjects. The use of independently derived datasets to confirm sequences allowed us to avoid over-estimating the fraction of novel sequences within the datasets; instead, we identified a small subset of sequences that warrant further investigation. Overall, we estimated that less than 0.07% of high-quality, non-redundant 16S rDNA sequences are truly novel, based on our strict definition of novel as sequences that are at most 97% identical to any sequence in public databases.
The false positives that we eliminated based on our rigorous validation requirements included sequences that appeared novel because of chimerism and sequencing errors. Of the 16S rDNA sequences with less than 97% identity to 16S rDNA reference sequences, less than 1% of the Roche variable region sequences were validated while 25% of the Roche WGS sequences were validated (). This discrepancy in the percent of reads validated in each dataset is due to the presence of chimeric sequences in the Roche variable regions data. The Roche variable region sequences had been screened for chimeras, but chimeric sequences were still abundant in the subset of sequences that appeared to be novel. Of the potentially novel sequences, the chimeric sequences were the most divergent from a nearest neighbor and were frequently classified poorly by RDP. Thus, an analysis of novel sequences without our validation steps would have grossly overestimated the number of novel sequences and the taxonomic level at which diversity was found. Sequencing errors (insertions, deletions, miscalled or uncalled bases) were also a source of false-positive novel sequences. We did not expect or detect chimeric sequences in the Roche WGS dataset, and the false-positive novel sequences in this case were due predominantly to sequencing errors. While sequencing errors can cause sequences to appear more divergent from a nearest neighbor, small errors do not affect RDP classification. It is likely that including sequences with sequencing errors would have slightly inflated our estimation of novel sequences, but the diversity still would have appeared at the genus or subgenus level. The artifacts we eliminated by using Illumina WGS sequences for validation allows us to more accurately define the degree of novelty and the characteristics of the novel taxa, including their distribution among individuals, abundance, and taxonomic classification.
Using the Illumina WGS data to evaluate the Roche variable region datasets establishes a database-independent method to detect chimeric sequences. Although the Roche V1–3 and V3–5 sequences had been screened with Chimera Slayer 
, some chimeric sequences were not identified because Chimera Slayer (and other database-dependent methods of chimera detection) only detects chimeras with components that are similar to the sequences in the reference database 
. Chimeric sequences can be very abundant in 16S rDNA amplicon datasets, accounting for more than half of the diversity in OTUs in some samples 
, and the same chimera can be formed in independent amplifications of the same sample 
. Chimeric 16S rDNA sequences are likely to be deposited in public databases at a relatively high rate 
. Chimeric sequences confound 16S rDNA analyses that rely on accurate sequences for alignment, the generation of distance matrices and phylogenetic tree construction, and they also inflate the degree of novelty estimated in bacterial communities. Thus, novel methods of chimera detection, such as the one presented here, can improve the analysis of metagenomic data.
While several distinct clusters of reads were found across the bacterial tree of life, these sequences represent the diminishing portion of the human gut microbiome that remains unobserved. Earlier studies estimated that proportion was much higher. The novel taxa are of relatively low abundance (<1% of the total reads in a sample), but some are found in multiple individuals (0.5–20% of individuals at this depth of sequencing), and this recurrence suggests they are endemic to the human microbiome. Low-abundance organisms may have important roles in microbial communities. Dominant organisms in a community can be used to broadly define microbial habitats in the human body 
(The Human Microbiome Consortium (2012) ‘Structure, Function and Diversity of Human Microbiome in an Adult Reference Population’. Nature: doi:10.1038/nature11234) and even microbial community classes within individuals 
(Zhou et al, manuscript in preparation). However, even minor components of bacterial communities may influence human health 
or can overgrow under certain conditions to cause disease 
. The great diversity in these low-abundance organisms may allow us to further distinguish individuals by their bacterial communities or ascribe functional characteristics related to human health and disease.
We detected novel taxa in the three most prevalent phyla in stool samples: Bacteroidetes, Firmicutes, and Proteobacteria. Most of the diversity we detected was at the genus or sub-genus level, but a few novel 16S rDNA reads were divergent from all other observed sequences at a higher taxonomic level, potentially as high as the family level. The novel sequences identified here indicate that there are undiscovered bacterial taxa. However, the functional capabilities and other characteristics of these bacteria remain to be determined by whole-genome sequencing, isolation and culturing of the organisms, and by studying of the role of these bacteria within their communities. The novel 16S rDNA sequences can be used to target specific bacteria for whole-genome sequencing and will give us insight into what gene functions these uncharacterized organisms are contributing to the gut microbial community. Indeed, many of the novel OTUs are related to organisms that have already been identified by the HMP reference genome sequencing effort and are sequenced or in the process of being sequenced (http://www.hmpdacc-resources.org/hmp_catalog/main.cgi
). For example, isolates from Barnesiella, Desulfovibrio, Dorea and Turicibacter
are all included in the collection of HMP microbial reference genomes.
The two most prevalent novel taxa we detected are in the genus Barnesiella
. OTU 12 was found in a single subject and OTU 13 was found in eight subjects. These two OTUs were found only in samples that were collected in Houston, suggesting they may be suited to bacterial communities related to specific environmental or dietary factors. Barnesiella
was recently discovered in samples from the chicken gut 
and in human stool samples 
. Additional studies are needed to determine the roles of Barnesiella
species in healthy communities and communities affected by diet or disease.
Some of the other novel taxa we detected in these stool samples are related to genera associated with shifts in the microbiome related to diet (Oscillibacter
, colorectal adenoma (Dorea
, and opportunistic infections (Desulfovibrio
. More thorough characterization and whole-genome sequencing of the novel taxa we identified will be important in order to determine if each of the novel taxa constitute new families, genera or species and if they affect or fluctuate with human health conditions.
A better understanding of the effects of bacteria that are less abundant, less broadly distributed and less well characterized is important as we seek to understand the complex gut microbiome in healthy individuals and link changes in the microbiome to disease. Furthermore, the pipelines and analysis criteria we developed can be applied to the analysis of novel 16S rDNA sequences in other microbial communities or habitats, which may be less well characterized than stool. The HMP has generated sequence data from samples taken from oral, skin, nasal, and vaginal body sites in addition to stool, and we anticipate that the examination of the bacterial communities in those body sites may reveal diversity that has yet to be appreciated.