PMCCPMCCPMCC

Search tips
Search criteria 

Advanced

 
Logo of plosonePLoS OneView this ArticleSubmit to PLoSGet E-mail AlertsContact UsPublic Library of Science (PLoS)
 
PLoS One. 2011; 6(6): e21105.
Published online 2011 June 28. doi:  10.1371/journal.pone.0021105
PMCID: PMC3125174

Extrapolation of Urn Models via Poissonization: Accurate Measurements of the Microbial Unknown

Dongxiao Zhu, Editor

Abstract

The availability of high-throughput parallel methods for sequencing microbial communities is increasing our knowledge of the microbial world at an unprecedented rate. Though most attention has focused on determining lower-bounds on the An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e001.jpg-diversity i.e. the total number of different species present in the environment, tight bounds on this quantity may be highly uncertain because a small fraction of the environment could be composed of a vast number of different species. To better assess what remains unknown, we propose instead to predict the fraction of the environment that belongs to unsampled classes. Modeling samples as draws with replacement of colored balls from an urn with an unknown composition, and under the sole assumption that there are still undiscovered species, we show that conditionally unbiased predictors and exact prediction intervals (of constant length in logarithmic scale) are possible for the fraction of the environment that belongs to unsampled classes. Our predictions are based on a Poissonization argument, which we have implemented in what we call the Embedding algorithm. In fixed i.e. non-randomized sample sizes, the algorithm leads to very accurate predictions on a sub-sample of the original sample. We quantify the effect of fixed sample sizes on our prediction intervals and test our methods and others found in the literature against simulated environments, which we devise taking into account datasets from a human-gut and -hand microbiota. Our methodology applies to any dataset that can be conceptualized as a sample with replacement from an urn. In particular, it could be applied, for example, to quantify the proportion of all the unseen solutions to a binding site problem in a random RNA pool, or to reassess the surveillance of a certain terrorist group, predicting the conditional probability that it deploys a new tactic in a next attack.

Introduction

A fundamental problem in microbial ecology is the “rare biosphere” [1] i.e. the vast number of low-abundance species in any sample. However, because most species in a given sample are rare, estimating their total number i.e. An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e002.jpg-diversity is a difficult task [2], [3], and of dubious utility [4], [5]. Although parametric and non-parametric methods for species estimation show some promise [6], [7], microbial communities may not yet have been sufficiently deeply sampled [8] to test the suitability of the models or fit their parameters. For instance, human-skin communities demonstrate an unprecedented diversity within and across skin locations of same individuals, with marked differences between specimens [9].

In an environment composed of various but an unknown number of species, let An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e003.jpg be the proportion in which a certain species An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e004.jpg occurs. Samples from microbial communities may be conceptualized as sampling–with replacement–different colored balls from an urn. The urn represents the environment where samples are taken: soil, gut, skin, etc. The balls represent the different members of the microbial community, and each color is a uniquely defined operational taxonomic unit.

In the non-parametric setting, the urn is composed by an unknown number of colors occurring in unknown relative proportions. In this setting, the An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e005.jpg-diversity of the urn [10] corresponds to the cardinality of the set An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e006.jpg. Although various lower-confidence bounds for this parameter have been proposed in the literature [11][14], tight lower-bounds on An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e007.jpg-diversity are difficult in the non-parametric setting because a small fraction of the urn could be composed by a vast number of different colors [15]. Motivated by this, we shift our interest to predicting instead the fraction of balls with a color unrepresented in the first An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e008.jpg observations from the urn. This is the unobservable random variable:

equation image

where An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e010.jpg denote the sequence of colors observed when sampling An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e011.jpg balls from the urn. Notice how An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e012.jpg depends both on the specific colors observed in the sample, and the unknown proportions of these colors in the urn. This quantity is very useful to assess what remains unknown in the urn. For instance, the probability of discovering a new color with one additional observation is precisely An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e013.jpg, and the mean number of additional observations to discover a new color is An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e014.jpg. We note that An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e015.jpg corresponds to what is called the conditional coverage of a sample of size An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e016.jpg in the literature. For this reason, we refer to An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e017.jpg as the conditional uncovered probability of the sample.

The expected value of An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e018.jpg is given by:

equation image

Unlike the conditional uncovered probability of the sample, An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e020.jpg is a parameter that depends on the unknown urn composition but not on the specific colors observed in the sample. Interest in the above quantities or related ones has ranged from estimating the probability distribution of the keys used in the Kenngruppenbuch (the Enigma cipher book) in World War II [16], to assessing the confidence that an iterative procedure with a random start has found the global maximum of a given function [17], to predicting the probability of discovering a new gene by sequencing additional clones from a cDNA library [18]. We note that An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e021.jpg is called the expected coverage of the sample in the literature.

Various predictors of An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e022.jpg and estimators of An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e023.jpg have been proposed in the literature. These are mostly based on a user-defined parameter An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e024.jpg and the statistics An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e025.jpg, An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e026.jpg; defined as the number of colors observed An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e027.jpg-times, when An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e028.jpg additional balls are sampled from the urn.

Turing and Good [19] proposed to estimate An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e029.jpg using the biased statistic An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e030.jpg. Posteriorly, Robbins [20] proposed to predict An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e031.jpg using

equation image
(1)

which he showed to be unbiased for An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e033.jpg and to satisfy the inequality An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e034.jpg. Despite the possibly small quadratic variation distance between An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e035.jpg and Robbins' estimator, and as illustrated by the plots on the left side of Fig. 1, when using Robbins' estimator to predict An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e036.jpg sequentially with An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e037.jpg (to assess the quality of the predictions at various depths in the sample), we observe that unusually small or large values of An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e038.jpg may offset subsequent predictions of An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e039.jpg. In fact, as seen on the right-hand plots of the same figure, an offset prediction is usually followed by another offset prediction of the same order of magnitude, even An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e040.jpg observations later (correlation coefficient of green clouds, An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e041.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e042.jpg on top- and bottom-right plots).

Figure 1
Point predictions in a human-gut and exponential urn.

Subsequently, for each An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e049.jpg, Starr [21] proposed to predict An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e050.jpg using

equation image
(2)

Even though An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e052.jpg is the minimum variance unbiased estimator of An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e053.jpg based on An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e054.jpg additional observations from the urn [22], Starr showed that An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e055.jpg may be strongly negatively correlated with An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e056.jpg when An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e057.jpg (note that Starr's and Robbins' estimators are identical when An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e058.jpg). Furthermore, the sequential prediction of An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e059.jpg via Starr's estimator is affected by issues similar to Robbins' estimator, which is also illustrated in Fig. 1, even when the parameter An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e060.jpg is set as large as possible, namely An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e061.jpg is equal to the sample size (correlation coefficient of orange clouds, An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e062.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e063.jpg on top- and bottom-right, respectively). We observe that An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e064.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e065.jpg are indistinguishable in a linear scale when An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e066.jpg because, for each An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e067.jpg, it applies that (see Materials and Methods):

equation image
(3)

In terms of prediction intervals, if An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e069.jpg denotes the An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e070.jpg upper quantile of a standard Normal distribution, it follows from Esty's analysis [23] that if An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e071.jpg is not very near An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e072.jpg or An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e073.jpg then

equation image
(4)

is approximately a An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e075.jpg prediction interval for An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e076.jpg. In practice, and as seen in Fig. 2, when the center of the interval is of a similar or lesser order of magnitude than its radius, the ratio between the upper- and lower-bound of these intervals may oscillate erratically, sometimes over several orders of magnitude. This can be an issue in assessing the depth of sampling in rich environments. For instance, to be highly confident that An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e077.jpg is not of practical use because one may need from An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e078.jpg to An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e079.jpg additional observations to discover a new species.

Figure 2
Prediction intervals in the human-gut and exponential urn.

The issues of the aforementioned methods are somewhat expected. On one hand, the problem of predicting An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e085.jpg is very different from estimating An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e086.jpg: the former requires predicting the exact proportion of balls in the urn with colors outside the random set An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e087.jpg, rather than in average over all possible such sets. On the other hand, the point estimators of An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e088.jpg are unlikely to predict An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e089.jpg accurately in a logarithmic scale, unless the standard deviation of An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e090.jpg is small relative to An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e091.jpg. Finally, the methods we have described from the literature were designed for static situations i.e. to predict An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e092.jpg or estimate An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e093.jpg when An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e094.jpg is fixed.

Results

Embedding Algorithm

Here we propose a new methodology to address the issues of the methods presented in the Introduction to predict An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e095.jpg. Our methodology lends itself better for a sequential analysis and accurate predictions in a logarithmic scale; in particular, also in a linear scale–though it relies on randomized sample sizes. Due to this, in static situations i.e. for fixed sample sizes, our method only yields predictions for a random sub-sample of the original sample.

Randomized sample sizes are more than just an artifact of our procedure: due to Theorem 1 below, for any predetermined sample size, there is no deterministic algorithm to predict An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e096.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e097.jpg unbiasedly, unless the urn is composed by a known and flat distribution of colors. See the Materials and Methods section for the proofs of our theorems.

Theorem 1 If An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e098.jpg is a continuous and one-to-one function then the following two statements are equivalent: (i) there is a non-randomized algorithm based on An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e099.jpg to predict An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e100.jpg conditionally unbiased; (ii) the urn is composed by a known and equidistributed number of colors.

Our methodology is based on a so called Poissonization argument [24]. This technique is often used in allocation problems to remove correlations [25]. It was applied in [26] to show that the cardinality of the random set An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e101.jpg is asymptotically Gaussian after the appropriate renormalization. Mao and Lindsay [27] used implicitly a Poissonization argument to argue that intervals such as in equation (4) have a An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e102.jpg asymptotic confidence, under the hypothesis that the times at which each color in the urn is observed obey a homogeneous Poisson point process (HPP) with a random intensity. Here, asymptotic means that the An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e103.jpg-diversity tends to infinity, which entails adding colors into the urn. Our approach, however, is not based on any assumption on the times the data was collected, nor on an asymptotic rescaling of the problem, but rather on the embedding of a sample from an urn into a HPP with intensity An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e104.jpg in the semi-infinite interval An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e105.jpg. We emphasize that the HPP is a mathematical artifice simulated independently from the urn.

In what follows, An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e106.jpg is a user-defined integer parameter. We have implemented the Poissonization argument in what we call the Embedding algorithm in Table 1. For a schematic description of the algorithm see Fig. 3 and, for its heuristic, consult the Materials and Methods section.

Figure 3
Schematic description of the Embedding algorithm.
Table 1
Embedding algorithm.

Suppose that a set An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e143.jpg of colors is already known to belong to the urn and let An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e144.jpg be the coverage probability of the colors in this set. We note that, in the context of the previous discussion, An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e145.jpg with An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e146.jpg.

To predict An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e147.jpg, draw balls from the urn until An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e148.jpg colors outside An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e149.jpg are observed. Visualize each observation as a colored point in the interval An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e150.jpg. The Poissonization consists in spacing these points out using independent exponential random variables with mean one. Due to the thinning property of Poisson point processes [28], the position An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e151.jpg of the point farthest apart from An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e152.jpg has a Gamma distribution with mean An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e153.jpg. We may exploit this to obtain conditionally unbiased predictors and exact prediction intervals for An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e154.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e155.jpg as follows. Regarding direct predictions of An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e156.jpg, note that measuring An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e157.jpg in a logarithmic rather than linear scale makes more sense when deep sampling is possible.

Theorem 2 Conditioned on An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e158.jpg and the event An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e159.jpg, the following applies:

  1. If An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e160.jpg then An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e161.jpg is unbiased for An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e162.jpg, with variance An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e163.jpg.
  2. If An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e164.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e165.jpg denotes Euler's constant then An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e166.jpg is unbiased for An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e167.jpg, with variance An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e168.jpg, which is bounded between An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e169.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e170.jpg.
  3. If An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e171.jpg, An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e172.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e173.jpg are such that
    equation image
    (5)
    then the interval An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e175.jpg contains An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e176.jpg with exact probability An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e177.jpg; in particular, An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e178.jpg contains An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e179.jpg also with probability An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e180.jpg.

We note that An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e181.jpg is the uniformly minimum variance unbiased estimator of An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e182.jpg based on An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e183.jpg exponential random variables with unknown mean An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e184.jpg. Furthermore, An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e185.jpg converges almost surely to An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e186.jpg, as An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e187.jpg tends to infinity; in particular, the point predictors in part (i) and (ii) are strongly consistent.

We also note that the logarithm of the statistic in part (i) under-estimates An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e188.jpg in average. In fact, the difference between the natural logarithm of the statistic in (i) and the statistic in (ii) is An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e189.jpg, which is negative for An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e190.jpg, and increases to zero as An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e191.jpg tends to infinity. From a computational stand point, however, the statistics An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e192.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e193.jpg differ by at most An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e194.jpg-units when An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e195.jpg. The same precision may be reached for smaller values of An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e196.jpg if larger bases are utilized. For instance, in base-10, the discrepancy will be at most An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e197.jpg for An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e198.jpg.

In regards to part (iii) of the theorem, we note that our prediction intervals for An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e199.jpg cannot contain zero unless An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e200.jpg. On the other hand, since the density function used in equation (5) is unimodal, the shortest prediction interval for An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e201.jpg corresponds to a pair of non-negative constants An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e202.jpg such that:

equation image
(6)

Similarly, optimal prediction intervals for An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e204.jpg follow when

equation image
(7)

with An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e206.jpg (see Materials and Methods for a numerical procedure to approximate these constants). In either case, because An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e207.jpg converges in distribution to a standard Normal as An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e208.jpg tends to infinity, one may select in (5) the approximate constants An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e209.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e210.jpg. With these approximate values, if An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e211.jpg then the true confidence An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e212.jpg of the associated prediction intervals satisfies (see Materials and Methods):

equation image
(8)

(The term on the exponential on the left-hand side above is big-O of An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e214.jpg; in particular, the lower-bound is of the same asymptotic order than the upper-bound.) We note that the constants produced by the Normal approximation may be crude for relatively large values of An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e215.jpg, as seen in Table 2.

Table 2
Optimal versus asymptotic An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e216.jpg prediction intervals.

As high-throughput technologies allow deeper sampling of microbial communities, it will be increasingly important to have upper- and lower-bounds for An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e247.jpg of a comparable order of magnitude. Since the prediction intervals for this quantity in Theorem 2 are of the form An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e248.jpg, and the ratio between the upper- and lower-bound of this interval is An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e249.jpg, one may wish to determine constants An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e250.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e251.jpg such that, not only (5) is satisfied, but also

equation image
(9)

where An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e253.jpg is a user-defined parameter. Not all values of An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e254.jpg are attainable for a given An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e255.jpg and confidence level. In fact, the smallest attainable value is given by the constants associated with the optimal prediction interval for An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e256.jpg. Equivalently, An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e257.jpg is attainable if and only if

equation image

Conversely, and as stated in the following result, any value of An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e259.jpg is attainable at a given confidence level, provided that the parameter An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e260.jpg is selected sufficiently large.

Theorem 3 Let An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e261.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e262.jpg be fixed constants. For each An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e263.jpg sufficiently large, there are constants An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e264.jpg such that (5) and (9) are satisfied.

For a given parameter An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e265.jpg, there are at most two constants An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e266.jpg such that An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e267.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e268.jpg are prediction intervals for An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e269.jpg with exact confidence An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e270.jpg. We refer to these as conservative-lower and conservative-upper prediction intervals, respectively. We refer to intervals of the form An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e271.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e272.jpg as upper- and lower-bound prediction intervals, respectively. See Table 3 for the determination of these constants for various values of An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e273.jpg when An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e274.jpg.

Table 3
Constants associated with 95% prediction intervals.

Effect of non-randomized sample sizes

The Embedding algorithm provides conditionally unbiased predictors and intervals for An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e295.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e296.jpg, provided that an arbitrary number of additional observations is possible until observing An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e297.jpg balls with colors outside An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e298.jpg. When dealing with fixed sample sizes, there is a positive probability of not meeting this condition, in which case the Embedding Algorithm is inconclusive. In large samples however, such as those collected in microbial datasets, the algorithm may be applied sequentially until it yields an inconclusive prediction. In such case, the true confidence of the prediction intervals produced by the algorithm satisfy the following.

Theorem 4 Suppose that condition (5) is satisfied. Conditioned on An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e299.jpg, if An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e300.jpg balls with colors outside An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e301.jpg are observed in the next An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e302.jpg draws from the urn, then the true confidence An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e303.jpg of the prediction interval for An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e304.jpg produced by the Embedding algorithm satisfies:

  1. if An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e305.jpg then An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e306.jpg;
  2. if An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e307.jpg then An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e308.jpg, where An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e309.jpg is a Gamma random variable with parameters An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e310.jpg, and An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e311.jpg is a Negative Binomial random variable with parameters An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e312.jpg.

Thus, if the Embedding algorithm produces an output in what remains of a finite sample size, the upper-bound prediction interval for An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e313.jpg has at least the user-defined confidence. This is perhaps the case of most interest in applications: it allows the user to estimate the least number of additional samples to observe a color not seen in any sample. For the other three interval types, the true confidence is approximately at least the targeted one if the probability that the algorithm produces an output in what remains of the sample is large.

Discussion

Comparisons with Robbins-Starr estimators

Note that, like Robbins' and Starr's estimators, our method requires extracting additional balls from the urn to make a prediction. However, unlike the methods of the Introduction, our method uses only the additionally collected data–instead of all the data ever collected from the urn–to make a prediction. In terms of sequential analysis, this is advantageous to recover from earlier erroneous predictions (we expand on this point in the next section, see Fig. 1).

In what remains of this section, An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e314.jpg hence An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e315.jpg, the conditional uncovered probability of a sample of size An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e316.jpg. Furthermore, to rule out trivial cases, we assume that An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e317.jpg with positive probability i.e. the urn is composed by more than just balls of a single color.

Part (i) of Theorem 1 provides a conditionally unbiased predictor for An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e318.jpg. We can show, however, that Robbins' and Starr's estimators are not conditionally unbiased for An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e319.jpg in the non-parametric case when An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e320.jpg. To see this argument, first notice that An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e321.jpg due to the inequality (3). On the other hand, if An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e322.jpg is a color in the urn such that An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e323.jpg then

equation image

As a result:

equation image

Hence, if there exists a color An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e326.jpg in the urn that makes the above quantity strictly positive (there are infinitely many such urns, including all urns composed by infinitely many colors, because An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e327.jpg) then An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e328.jpg cannot be conditionally unbiased for An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e329.jpg.

On the other hand, due to parts (i) and (ii) in Theorem 1, we obtain (see Materials and Methods):

equation image
(10)
equation image
(11)

where An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e332.jpg denotes correlation and An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e333.jpg variance. Consequently, the point predictors in Theorem 1 are positively correlated with the quantities they were designed to predict. This contrasts with Robbins' estimator, which may be strongly negatively correlated with An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e334.jpg. For instance, if An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e335.jpg for An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e336.jpg different colors in the urn, it is shown in [21] that the asymptotic correlation between An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e337.jpg and Robbins' estimator An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e338.jpg is asymptotically negative when An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e339.jpg converges to a strictly positive but finite constant An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e340.jpg. In this same regime but provided that An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e341.jpg, we can show that (see Materials and Methods):

equation image
(12)

Since the right-hand side above is negative for all An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e343.jpg sufficiently small, Starr's estimator An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e344.jpg may also have a strong negative correlation with An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e345.jpg when An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e346.jpg is much smaller than An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e347.jpg.

A further calculation based on parts (i) and (ii) in Theorem 1 shows that

equation image
equation image

In particular, for fixed An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e350.jpg, the correlations in equations (10) and (11) approach to one as An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e351.jpg tends to infinity.

Finally, for non-trivial urns with finite An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e352.jpg-diversity, i.e. urns composed by balls with at least two but a finite number of different colors, one can show for fixed An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e353.jpg that the correlation in equation (10) approaches An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e354.jpg as An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e355.jpg tends to infinity. Furthermore, if we again assume that An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e356.jpg for An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e357.jpg different colors in the urn and An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e358.jpg converges to a strictly positive but finite constant, then the correlation in equation (10) approaches zero from above. As we pointed out before, in this regime, Robbins' estimator is asymptotically negatively correlated with An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e359.jpg.

Selection of parameters

There are two main criteria to select the parameter An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e360.jpg of the Embedding algorithm in a concrete application.

One criteria applies for point predictors. In this case, conditioned on An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e361.jpg, the standard deviation of the relative error of our prediction of An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e362.jpg is An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e363.jpg (Theorem 2, part (i)). To predict An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e364.jpg, An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e365.jpg should be therefore selected as small as possible so as to meet the user's tolerance on the average relative error of our predictions. The same criteria applies for point predictors of An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e366.jpg, for which the standard deviation of the absolute error is of order An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e367.jpg, uniformly for all An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e368.jpg (Theorem 2, part (ii)).

A different criteria applies for prediction intervals. In this case, conditioned on An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e369.jpg, the user should first specify the confidence level, and how much larger he wants the upper-prediction-bound to be in relation to the lower-prediction bound of An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e370.jpg. Since the ratio between these last two quantities is given by the parameter An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e371.jpg in (9), An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e372.jpg should be selected as small as possible to meet the user's pre-specified factor An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e373.jpg for the given confidence level of the prediction interval (Theorem 3). See Table 4 for the optimal choice of An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e374.jpg for various values of An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e375.jpg when An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e376.jpg. Note that for the selected parameter An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e377.jpg, the constants associated with the optimal prediction intervals are given in equations (6) and (7), see Materials and Methods.

Table 4
Optimal selection of parameter An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e378.jpg in terms of parameter An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e379.jpg.

Simulations on analytic and non-analytic urns

We tested our methods against an urn with an exponential relative abundance rank curve over An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e395.jpg species, and an urn matching the observed distribution of microbes in a human-gut sample from [29]. We also analyzed a sample from a human-hand microbiota found in [30]. The gut and hand data are part of the largest microbial datasets collected thus far (see Fig. 4 for the relative abundance rank curve associated with each urn). The relative abundance rank curve, or for simplicity “rank curve”, associated with an urn is a graphical representation of its composition: the height of the graph above a non-negative integer An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e396.jpg is the fraction of balls in the urn with the An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e397.jpg-th most dominant color.

Figure 4
Rank curves associated with the human-gut, human-hand and exponential urn.

The blue dots and red curves on the plots on the left side in Fig. 1 show very accurate point predictions in log-scale of the conditional uncovered probability (as a function of the number of observations), when we apply the Embedding algorithm to a sample of size An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e398.jpg from the human-gut and exponential urn, respectively. In both instances, the parameter An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e399.jpg of the Embedding algorithm was set to An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e400.jpg. The accuracy of our method is confirmed by the red clouds on the plots on the right side of Fig. 1, which are centered around An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e401.jpg. The red clouds also indicate that our predictions recover more easily from offset predictions as compared to Robbins' and Starr's (correlation coefficient of red clouds, An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e402.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e403.jpg on top- and bottom-right, respectively). This is to be expected because the Embedding algorithm relies only on the additionally collected data to make a new prediction, whereas Robbins' and Starr's estimators use all the data ever collected from the urn. On the other hand, the red and blue curves in Fig. 2 show that the conservative-upper and -lower prediction intervals of the conditional uncovered probability (also as a function of the number of observations) contain this quantity with high probability and, unlike Esty's intervals, have a constant length in logarithmic scale. The intervals on the plots on the right side are tighter than those on the left because of the decrease of the parameter An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e404.jpg from An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e405.jpg to An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e406.jpg. In each case, the parameter An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e407.jpg was selected according to the guidelines in Table 4. We note that sequential predictions based on the Embedding Algorithm in figures 1 and and22 were produced until the algorithm yielded inconclusive predictions. For this reason, our predictions ended before exhausting each sample.

In the human-hand dataset, An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e408.jpg species were observed in a sample of size An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e409.jpg. To simulate draws with replacement from this environment, we produced a random permutation of the data (see Materials and Methods section). Using the Embedding algorithm with parameters An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e410.jpg, and according to our point predictor, An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e411.jpg of the species observed in the sample represent An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e412.jpg of that hand environment; in particular, the remaining An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e413.jpg is composed by at least An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e414.jpg species. Furthermore, according to our upper-bound prediction interval, and with at least a An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e415.jpg confidence, the species not represented in the sample account for less than An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e416.jpg of that environment.

To test the above predictions, we simulated the rare biosphere as follows. We hypothesized that our point prediction of the conditional uncovered probability could be offset by up to one order of magnitude. We also hypothesized that the number of unseen species in the sample had an exponential relative abundance rank curve, composed either by An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e417.jpg, An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e418.jpg or An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e419.jpg species. This leads to nine different urns in which to test our methods. These urns are devised such that they gradually change from the almost unchanged urn in the bottom left corner to the urn in the upper right, which is dominated by rare species (see Fig. 5 for the associated rank curves). As seen on the plots in Fig. 6, the Embedding algorithm yields very accurate predictions in each of these nine scenarios, for all the sample sizes considered.

Figure 5
Rank curves associated with the rare biosphere simulation in the human-gut and -hand urn.
Figure 6
Predictions in the human-hand urn when simulating the rare biosphere.

As seen in Fig. 7, our predictions are also in excellent agreement with the human-gut dataset when we simulate the rare biosphere. As expected, the conditional uncovered probability almost always lies between the predicted bounds. We also note that the predictions based on the Embedding algorithm are accurate even for a small number of observations. This suggests that our algorithm can be applied to deeply as well as shallowly sampled environments.

Figure 7
Predictions in the human-gut urn when simulating the rare biosphere.

Materials and Methods

Heuristic behind the Embedding algorithm

The number of times a rare color occurs in a sample from an urn is approximately Poisson distributed. In the non-parametric setting, a direct use of this approximation is tricky because “rare” is relative to the sample size and the unknown urn composition. The embedding into a HPP is a way to accommodate for the Poisson approximation heuristic, without making additional assumptions on the urn's composition. To fix ideas, imagine that no ball in the urn is colored black. Make up a second urn with a single ball colored black. We refer to this as the “black-urn”. Now sample (with replacement) balls according to the following scheme: draw a ball from the original- versus black-urn with probability An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e442.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e443.jpg, respectively, where An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e444.jpg is a fixed but small parameter. Under this sampling scheme, even the most abundant colors in the original-urn are rare. In particular, the smaller An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e445.jpg is, the closer is the distribution of the number of times a particular set of colors (excluding black) is observed to a Poisson distribution. This approach is not very practical, however, because the number of samples to observe a given number of balls from the original urn can be astronomically large when An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e446.jpg is very small. To overpass this issue imagine drawing a ball every An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e447.jpg-seconds. Draws from the original urn will then be apart An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e448.jpg seconds, where An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e449.jpg has a Geometric distribution with mean An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e450.jpg. As a result: An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e451.jpg, for An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e452.jpg. Thus, as An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e453.jpg gets smaller, the time-separations between consecutive samples from the original urn resemble independent Exponential random variables with mean one. The black-urn can therefore be removed from the heuristic altogether by embedding samples from the original urn into a HPP with intensity one over the interval An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e454.jpg.

Simulating draws with replacement

To simulate draws with replacement using data already collected from an environment, produce a random permutation of the data. This can be accomplished with low-memory complexity using the discrete inverse transform method to simulate draws–without replacement–from a finite population [31].

Constants associated with optimal prediction intervals

To numerically approximate a pair of constants An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e455.jpg such that An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e456.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e457.jpg, where the integer An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e458.jpg and the number An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e459.jpg are given constants, introduce the auxiliary variable An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e460.jpg, and note that the later condition is fulfilled only when An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e461.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e462.jpg. Due to Newton's method, the sequence An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e463.jpg defined recursively as follows converges to the unique An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e464.jpg that satisfies the integrability condition, provided that An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e465.jpg is chosen sufficiently close to An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e466.jpg:

equation image
equation image
equation image

Proof of Inequality (3)

First notice that

equation image
(13)

To bound the first term on the right-hand side above, notice that An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e471.jpg. As a result, since An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e472.jpg, we obtain that:

equation image
(14)

On the other hand, to bound the second term on the right-hand side of equation (13), define the quantity An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e474.jpg and notice that An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e475.jpg. Using that a weighted average is at most the largest of the terms averaged, we obtain that:

equation image
(15)

where, for the last inequality, we have used that for each An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e477.jpg, the associated product is less or equal to the factor associated with the index An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e478.jpg. Equation (3) is now a direct consequence of equations (13), (14) and (15).

Proof of Theorem 1

In what follows, An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e479.jpg denotes the inverse function of An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e480.jpg.

Define An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e481.jpg to be the set of decreasing partitions of An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e482.jpg i.e. vectors of the form An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e483.jpg, with An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e484.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e485.jpg integers, such that An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e486.jpg. To each possible sample An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e487.jpg, let An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e488.jpg be the decreasing partition of An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e489.jpg associated with the observed ranks in the sample.

Define An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e490.jpg, for each set An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e491.jpg of colors. Part (i) in the theorem is equivalent to the existence of a function An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e492.jpg such that

equation image
(16)

with probability one. This is because, in the non-parametric setting, the different colors in the urn carry no intrinsic meaning apart from being different. If there is a certain function An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e494.jpg which satisfies condition (16) then An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e495.jpg, for each color An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e496.jpg such that An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e497.jpg. In particular, the set An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e498.jpg must be finite. Furthermore, if this set has cardinality An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e499.jpg then An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e500.jpg, for each color An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e501.jpg in the set; in particular, An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e502.jpg. Condition (ii) is therefore necessary for condition (i). Conversely, if condition (ii) is satisfied and the urn is composed by An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e503.jpg colors occurring in equal proportions then the function An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e504.jpg defined as An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e505.jpg satisfies condition (16).

Proof of Theorem 2

Conditioned on the set An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e506.jpg, and the random index An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e507.jpg used in Step 3 of the Embedding algorithm, An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e508.jpg has a Gamma distribution with shape parameter An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e509.jpg and scale parameter An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e510.jpg. However, because An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e511.jpg has a Negative Binomial distribution, conditioned on An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e512.jpg alone, An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e513.jpg has Gamma distribution with shape parameter An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e514.jpg and scale parameter An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e515.jpg. In particular, An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e516.jpg has probability density function An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e517.jpg, for An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e518.jpg. From this, parts (i) and (iii) in the theorem are immediate. To show part (ii), notice first that An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e519.jpg is conditionally unbiased for An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e520.jpg, where

equation image

The second identity above is due to an integration by parts argument and only holds for An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e522.jpg. However, since An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e523.jpg, we obtain that An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e524.jpg, for An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e525.jpg. This shows that An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e526.jpg is conditionally unbiased for An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e527.jpg. To complete the proof of the theorem, notice that An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e528.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e529.jpg have the same variance. In particular, An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e530.jpg, where

equation image

The last identity above holds only for An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e532.jpg. Using that An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e533.jpg, we conclude that An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e534.jpg, for An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e535.jpg. As a result: An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e536.jpg; in particular, since An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e537.jpg, An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e538.jpg. The theorem is now a consequence of the following inequalities:

equation image

Proof of Equation (8)

Let An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e540.jpg and assume that An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e541.jpg. Observe that:

equation image

The factor multiplying the previous integral is an increasing function of An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e543.jpg; in particular, due to Stirling's formula, it is bounded by An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e544.jpg from above. Furthermore, from section 6.1.42 in [32], it follows that

equation image

On the other hand, if one rewrites the integrand of the previous integral in an exponential-logarithmic form and uses that An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e546.jpg, for all An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e547.jpg, where

equation image

one sees that

equation image

All together, these inequalities imply that

equation image

from which the result follows.

Proof of Theorem 3

Due to the Central Limit Theorem, if An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e551.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e552.jpg then

equation image

As a result, for all An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e554.jpg sufficiently large, An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e555.jpg, and the integral on the left-hand side above is greater than or equal to An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e556.jpg. Fix any such An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e557.jpg. Since the value of the associated integral may be decreased continuously by increasing the parameter An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e558.jpg, there is An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e559.jpg such that An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e560.jpg and

equation image

Define An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e562.jpg, for An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e563.jpg. Since An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e564.jpg and, because An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e565.jpg, An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e566.jpg, the continuity of An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e567.jpg implies that there is An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e568.jpg such that An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e569.jpg. Selecting An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e570.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e571.jpg shows the theorem.

Proof of Theorem 4

The proof is based on a coupling argument. First observe that one can define on the same probability space random variables An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e572.jpg such that (1) An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e573.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e574.jpg have Negative Binomial distributions with parameters An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e575.jpg, but with An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e576.jpg conditioned to be less than or equal to An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e577.jpg; (2) An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e578.jpg but An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e579.jpg when An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e580.jpg; and (3) An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e581.jpg are independent Exponentials with mean An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e582.jpg and independent of An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e583.jpg.

Let An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e584.jpg be the event “An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e585.jpg balls with colors outside An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e586.jpg are observed in the next An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e587.jpg draws from the urn”. Conditioned on An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e588.jpg, we have that An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e589.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e590.jpg. As a result:

equation image
equation image

Since An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e593.jpg, and because An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e594.jpg when An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e595.jpg, we obtain that

equation image

From this, the upper-bound in part (i) and both inequalities in part (ii) follow after noticing that An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e597.jpg has a Gamma distribution with shape parameter An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e598.jpg and scale parameter An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e599.jpg. To show the lower-bound in (i), we again notice that An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e600.jpg. In particular, if An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e601.jpg then

equation image

Proof of Equations (10) and (11)

Consider random variables An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e603.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e604.jpg and a random vector An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e605.jpg, defined on a same probability space. Assume that An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e606.jpg is square-integrable and conditionally unbiassed for An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e607.jpg given An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e608.jpg i.e. An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e609.jpg. Furthermore, assume that An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e610.jpg hence An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e611.jpg. Because An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e612.jpg is also square-integrable and An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e613.jpg, we obtain that

equation image
equation image
equation image

Hence An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e617.jpg.

Equation (10) follows by considering An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e618.jpg, An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e619.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e620.jpg. Similarly, equation (11) follows by considering An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e621.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e622.jpg.

Proof of Inequality (12)

First note that

equation image
(17)

Now observe that An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e624.jpg because of inequality (13), which implies that An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e625.jpg because An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e626.jpg. On the other hand, because Robbins' and Starr's estimators are both unbiased for An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e627.jpg, we have An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e628.jpg. Furthermore, according to the proof of Theorem 2 in [21], An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e629.jpg, therefore

equation image

As a result, An external file that holds a picture, illustration, etc.
Object name is pone.0021105.e631.jpg. Inequality (12) is now a direct consequence of inequality (17), and the next identities [21]:

equation image
equation image
equation image

Acknowledgments

We would like to thank three anonymous referees for their careful reading of our manuscript and their numerous suggestions, which were incorporated in this final version. The authors are also thankful to R. Knight for contributing to an early version of the code, implementing some of the analyses, and commenting on the manuscript.

Footnotes

Competing Interests: The authors have declared that no competing interests exist.

Funding: M. Lladser was partially supported by NASA ROSES NNX08AP60G and NIH R01 HG004872. R. Gouet was supported by FONDAP, BASAL-CMM and Fondecyt 1090216. J. Reeder was supported by a post-doctoral scholarship from the German Academic Exchange Service (DAAD). M. Lladser and R. Gouet are thankful to the project Núcleo Milenio Información y Aleatoriedad, which facilitated a research visit to collaborate in person. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

1. Sogin ML, Morrison HG, Huber JA, Welch DM, Huse SM, et al. Microbial diversity in the deep sea and the underexplored \rare biosphere”. Proc Natl Acad Sci USA. 2006;103:12115–12120. [PubMed]
2. Hughes JB, Hellmann JJ, Ricketts TH, Bohannan BJ. Counting the uncountable: statistical approaches to estimating microbial diversity. Appl Environ Microbiol. 2001;67:4399–4406. [PMC free article] [PubMed]
3. Schloss PD, Handelsman J. Status of the microbial census. Microbiol Mol Biol Rev. 2004;68:686–691. [PMC free article] [PubMed]
4. Curtis TP, Head IM, Lunn M, Woodcock S, Schloss PD, et al. What is the extent of prokaryotic diversity? Phil Trans R Soc Lond. 2006;361:2023–2037. [PMC free article] [PubMed]
5. Roesch LF, Fulthorpe RR, Riva A, Casella G, Hadwin AK, et al. Pyrosequencing enumerates and contrasts soil microbial diversity. Isme J. 2007;1:283–290. [PMC free article] [PubMed]
6. Hong SH, Bunge J, Jeon SO, Epstein SS. Predicting microbial species richness. Proc Natl Acad Sci USA. 2006;103:117–122. [PubMed]
7. Quince C, Curtis TP, Sloan WT. The rational exploration of microbial diversity. Isme J. 2008;2:997–1006. [PubMed]
8. Turnbaugh PJ, Hamady M, Yatsunenko T, Cantarel BL, Duncan A, et al. A core gut microbiome in obese and lean twins. Nature. 2007;457:480–484. [PMC free article] [PubMed]
9. Fierer N, Lauber CL, Zhou N, McDonald D, Costello EK, et al. Forensic identification using skin bacterial communities and/or references within. Proc Natl Acad Sci USA. 2010;107:6477–6481. [PubMed]
10. Magurran AE. Measuring Biological Diversity. Oxford - Blackwell; 2004.
11. Burnham KP, Overton WS. Estimation of the size of a closed population when capture probabilities vary among animals. Biometrika. 1978;65:625–633.
12. Chao A. Nonparametric estimation of the number of classes in a population. Scand J Stat. 1984;11:265–270.
13. Chao A. Estimating the population size for capture-recapture data with unequal catchability. Biometrics. 1897;43:783–791. [PubMed]
14. Mao CX, Lindsay BG. Estimating the number of classes. Ann Stat. 2007;35:917–930.
15. Bunge J, Fitzpatrick M. Estimating the number of species: A review. J Am Stat Assoc. 1993;88:364–373.
16. Hinsley F, Stripp A. Codebreakers: The Inside Story of Bletchley Park. Oxford Univ. Press; 1993.
17. Finch SJ, Mendell NR, Thode HC., Jr Probabilistic measures of adequacy of a numerical search for a global maximum. J Am Stat Assoc. 1989;84:1020–1023.
18. Mao CX. Predicting the conditional probability of discovering a new class. J Am Stat Assoc. 2004;99:1108–1118.
19. Good IJ. The population frequencies of species and the estimation of population parameters. Biometrika. 1953;40:237–264.
20. Robbins HE. On estimating the total probability of the unobserved outcomes of an experiment. Ann Math Stat. 1968;39:256–257.
21. Starr N. Linear estimation of the probability of discovering a new species. Ann Stat. 1979;7:644–652.
22. Clayton MK, Frees EW. Nonparametric estimation of the probability of discovering a new species. J Am Stat Assoc. 1987;82:305–311.
23. Esty WW. A Normal limit law for a nonparametric estimator of the coverage of a random sample. Ann Statist. 1983;11:905–912.
24. Aldous D. Probability Approximations via the Poisson Clumping Heuristic. Springer-Verlag; 1988.
25. Mahmoud HM. Sorting: A Distribution Theory. Wiley-Interscience; 2000.
26. Hwang HK, Janson S. Local limit theorems for finite and infinite urn models. Ann Probab. 2008;36:992–1022.
27. Mao CX, Lindsay BG. A poisson model for the coverage problem with a genomic application. Biometrika. 2002;89:669–681.
28. Durrett R. Essentials of stochastic processes. Springer Texts in Statistics; 1999.
29. Turnbaugh PJ, Ridaura VK, Faith JJ, Rey FE, Knight R, et al. The e_ect of diet on the human gut microbiome: A metagenomic analysis in humanized gnotobiotic mice. Sci Transl Med. 2009;1:6ra14. [PMC free article] [PubMed]
30. Fierer N, Hamady M, Lauber CL, Knight R. The influence of sex, handedness, and washing on the diversity of hand surface bacteria. Proc Natl Acad Sci USA. 2008;105:17994–17999. [PubMed]
31. Ross SM. Simulation. Academic Press, third edition; 2002.
32. Abramowitz M, Stegun IA. Handbook of Mathematical Functions with Formulas, Graphs, and Mathematical Tables. New York: Dover, ninth Dover printing, tenth GPO printing edition; 1964.

Articles from PLoS ONE are provided here courtesy of Public Library of Science