The human body is host to microbial communities (microbiome) whose abundances are estimated to exceed the number of human cells by at least an order of magnitude
[1]. These communities are thought to influence human physiology through processes related to development, nutrition, immunity, and resistance to pathogens
[2],
[3],
[4],
[5],
[6]. The HMP
[7] was initiated to probe the nature and extent of the microbial communities living in and on the human body of so called “normal” adult donors in an effort to better understand their role in human health and disease, thus providing a critical baseline for future metagenomic studies of the human microbiome.
Since the vast majority of microbes are as yet uncultured by current techniques
[8], molecular-based, culture-independent techniques, such as the use of 16 S profiling have provided important new insights into the diversity of the microbial world across a variety of environments
[9],
[10],
[11],
[12],
[13] including the human microbiome
[14],
[15],
[16],
[17]. The HMP has generated an unprecedented scale of 16 S profiles to investigate the microbial diversity of the human microbiome consisting of over 230 donors and up to 18 body habitats across the oral cavity, skin, distal gut (stool), and when applicable, vaginal body regions ().
| Table 1Diversity indices computed on all genera-based taxonomic units. |
Understanding microbial community diversity is a critical first component for studying the human microbiome in order to elucidate the distribution and assembly patterns of microbial communities across body habitats and individuals and to facilitate microbial ecological and biological studies
[15],
[18],
[19]. Characterization of microbial community diversity from 16 S sequence reads requires three basic steps. First, reads must be classified into distinct taxonomic units thus creating a taxonomic profile. Next, the taxonomic profile is assessed by quantifying richness, or number of taxonomic units present, and finally the evenness, based on the abundance of taxonomic units
[20]. Commonly, such profiles are generated using two general approaches based on the sequences acquired through each 16 S rDNA sample.
The first method is based on the assignment of sequences to a taxonomic hierarchy
[21]. For example, the Ribosomal Database Project (RDP) Classifier, uses a naïve Bayesian classifier to rapidly classify bacterial 16 S rRNA sequences into a higher-order taxonomy (Bergey’s Taxonomic Outline of the Prokaryotes
[22]). Since a reference sequence from the database is utilized, sequences with minor sequencing errors may still be properly associated. However, novel organisms that have not had their 16 S sequences included into the database may be misclassified or considered unknown.
The second approach is independent of taxonomic classification and uses sequence similarity to form clusters within a predefined percent similarity, for example, 97% similarity
[21]. All reads within each cluster are considered to be in the same operational taxonomic unit, or “OTU.” OTU-based taxonomic profiles are prone to under-clustering. This occurs when reads from the same organism are divided into multiple taxonomic units due to sequencing error, or when there is significant diversity of the 16 S copies within a single organism
[23],
[24]. Since OTU-based taxonomic profiling is database independent, two benefits of its utilization include potentially differentiating between strains of the same species, and generating OTUs for as yet uncharacterized organisms. Due to the shortcomings and advantages of each approach, both techniques offer alternative views of a sample’s taxonomic profile, thus being both complementary and confirmatory. When the combined approaches are compared, resulting inconsistent inferences may elucidate interesting aspects of the samples being analyzed.
After generation of the 16 S profile, the inventory of the microbial community within a sample is assessed by quantifying the richness (defined as the number of unique taxa present), the diversity (which combines the concepts of both richness and evenness), or the abundance of taxonomic units that have been accounted for by the taxonomic profile
[21]. Diversity indices are often calculated in community diversity studies as they represent the distillation of this information into a single positive real

number
[25] whose magnitude can be more easily compared.
Estimating the maximum richness S
max, in microbial community environments, is an area of ecological and biological interest. This information can also be used in a practical sense to generate sequencing coverage estimates and to determine the proportion of the microbial community that has been discovered as the sequencing effort is progressing
[26]. A commonly used visualization tool to represent the discovery rate of new taxonomic units as samples are taken is the rarefaction curve. The x-axis of the graph is labelled with the number of samples (or donors) that have been observed, and the y-axis is the number of unique taxonomic units that have been observed within the samples collected thus far. The instantaneous slope and height of the curve inform the analyst about both the expected discovery rate of new taxonomic units and how many taxonomic units have been discovered at a particular point. S
max is found at the height of the curve when the slope is zero, i.e., the sampling saturation point. In previous studies, due at least in part to the prohibitive cost of sampling and sequencing, rarefaction curves based on empirical data have not typically acquired a large number of samples. Therefore, for communities with great diversity, the instantaneous slope of the last point of sampling does not become zero, as not all taxonomic units have been observed. Thus, to predict the number of taxonomic units in the community based on a rarefaction curve that is still climbing, the use of extrapolation is a reasonable next step. This is a different approach than the estimation of S
max based on the combined distribution of taxa across all samples for a specific body site. In this latter scenario, parametric finite mixture models or non-parametric coverage-based estimates, as implemented in Catchall
[27], would be applicable. Efforts have been made to extrapolate the rarefaction curve with various parametric models, such as the poisson log-normal cumulative distribution function (CDF)
[28]. In general, parametric statistical techniques can produce more accurate, precise and robust estimates, with greater statistical power than nonparametric methods, provided that the model fits the data correctly. When these assumptions are valid, estimating constants for the underlying distribution functions can be performed with optimization algorithms. However, when the model is only approximate, sparse sampling compounds the unreliability of parametric curve fitting, resulting in overfitting and outcomes that are poorly predictive. As a result, these techniques have not been generally recommended. However, the scope of the HMP has provided the community with a large number of samples from a large set of donors and a broad range of body habitats, enabling the realistic exploration of parametric modelling techniques. In particular, for each body habitat rarefaction curve, values for S
max were calculated and compared by applying these curve fitting approaches. Four well-understood CDFs, log-normal, gamma, Pareto
[29], and Fréchet
[30] were chosen for consideration in part as they are suitable for addressing ecological distributions that may be highly skewed towards the most dominant taxa.
The difficulty of measuring the microbial diversity in a sample is compressing the complexity of a profile, with a multi-dimensional representation of taxonomic abundance, into a single scalar statistic. At the same time, the number of reads per sample for a metagenomic sample is unlikely to have been sufficient to sample every organism, especially when a collected sample contains a large number of low abundant organisms, i.e., the “rare biosphere”
[9]. Commonly used ecological diversity indices for quantifying intrasample diversity, include the Shannon index, a measurement of entropy and the uncertainty of the sampling outcome, and Simpson’s diversity index, a description of the probability that randomly drawing two reads from a sample will produce the same taxon. In terms of application to ecological studies, each of these indices was originally derived, or adapted from macroecology. As such, individually they can perform well when approximating the microbial diversity of common taxa, however each may fall short as a single complete measure when examining the numerous low abundant organisms that dominate the composition of many microbial communities.
Both the Shannon and Simpson diversity indices have been shown by Hill
[31] through Rényi’s definition of generalized entropy
[32] to have similar characteristics, but differing only in the contribution of low abundant taxonomic units to the magnitude of the calculated statistic. Rényi unified the Shannon and Simpson diversity indices as entropies with a parameter α, the power to which the contribution of taxonomic abundances are raised. α’s of 2, 1, and 0, are associated with Simpson’s index, Shannon’s index, and the total number of species detected, respectively. It is possible to utilize fractional α’s, eg. 0.25, to increase the contribution of the low abundant taxonomic units to a desired impact (), but if their specification is arbitrary, resultant index values may not have meaningful probabilistic or information theoretic interpretations.
The estimated S
max is dependent on the number of taxonomic units and their relative abundance across all samples from the site of interest. Thus it may be expected that a strong correlation exists between calculated diversity indices such as the Shannon index and S
max between all samples. However, our investigations suggest that this index was unable to capture enough of the low abundant taxa to correlate well with the estimated S
max for many body habitats. This finding motivated us to formulate the Tail statistic, τ, a rank-based diversity measure with a much stronger correlation to S
max, and with intuitive characteristics matching the well-understood standard deviation statistic. As described more rigorously under
Materials and Methods, τ is the standard deviation of the rank abundance curve had it been made symmetric by reflection around the most abundant, or first, taxon. The more concentrated the taxa are to a few members, the smaller τ becomes. The sensitivity of τ to low abundant taxonomic units is comparable to a Rényi entropy with fractional α, but with a probabilistic interpretation. τ consists of the number of taxonomic units as a unit of measure and due to its similarity to the standard deviation statistic, σ, further takes advantage of existing known properties of σ such as the Chebychev’s inequality
[33]. The τ statistic provides an important complement to the study of microbial diversity as it has been derived to suit the nature of 16 S profiles which tend to exhibit a long-tailed distribution. These distributions reflect the nature of species abundance in which many sequences reside in a few taxonomic units while the majority of taxa are represented by only a few sequences. As such, τ more accurately captures the contribution of low abundant taxa facilitating an improved ability to elucidate the nature and extent of variation within and between individuals and body habitats over time, a central topic of study in human microbiome research
[34].
In this study, we present results of an investigation of human microbiome diversity across habitats and individuals from 16 S profiles generated by the HMP. First, we present the application of parametric curve fitting techniques to estimate the maximum number of taxa and the instantaneous discovery rates in human body habitats. Next, we present comparisons of the Shannon, τ and Smax diversity estimates between OTU and taxonomy (genera)-based taxonomic units, and between common and all taxonomic units. Further, by the introduction of new statistical analyses tuned for next generation sequencing and improved methods to visualize and interpret their output, we not only provide a better understanding of the diversity across body sites and individuals, particularly through the contribution of low abundant taxa, but we also show how various approaches to diversity analyses complement and confirm these observations.