Home | About | Journals | Submit | Contact Us | Français |

**|**Europe PMC Author Manuscripts**|**PMC2899312

Formats

Article sections

- Abstract
- 1. INTRODUCTION
- 2. ONE COMMUNITY
- 3. TWO COMMUNITIES
- 4. MORE THAN TWO COMMUNITIES
- 5. EXAMPLE
- 6. SIMULATION
- 7. CONCLUDING REMARKS AND DISCUSSION
- REFERENCES

Authors

Related links

J Agric Biol Environ Stat. Author manuscript; available in PMC 2010 July 8.

Published in final edited form as:

PMCID: PMC2899312

EMSID: UKMS31267

H.-Y. Pan is Assistant Professor, Department of Applied Mathematics, National Chia-Yi University, Chia-Yi, Taiwan 60004 (Email: wt.ude.uycn.liam@napyh). Anne Chao is Tsing Hua Distinguished Chair Professor, Institute of Statistics, National Tsing Hua University, Hsin-Chu, Taiwan 30043 (Email: wt.ude.uhtn.tats@oahc). Wilhelm Fossiner is University Professor, Universität Salzburg, FB Organismische Biologie, Hellbrunnerstrasse 34, A-5020 Salzburg, Austria (Email: ta.ca.gbs@RENSSIOF.mlehliw)

The publisher's final edited version of this article is available at J Agric Biol Environ Stat

See other articles in PMC that cite the published article.

In biological and ecological statistical inference, it is practically useful to provide a lower bound for species richness in a community. Chao (1984, 1989) derived a nonparametric lower bound for species richness in a single community. However, there have been no lower bounds proposed in the literature for the number of species shared by multiple communities. Based on sample species abundance or replicated incidence records from each of the *N* communities, we derive in this article a nonparametric approach to constructing a lower bound for the number of species shared by *N* (*N* ≥ 2) communities. The approach is valid for all types of species abundance distributions (for abundance data) or species detection probabilities (for replicated incidence data). Variance estimators for the proposed lower bounds are obtained by using typical asymptotic theory. Simulation results are reported to examine the performance of the lower bounds. Replicated incidence data of ciliate species collected in three areas from Namibia, southwest Africa, are used for illustration. We also briefly discuss the application of the proposed method to estimate the size of a shared population (i.e., the number of individuals in the intersection of multiple populations) based on capture-recapture data from each population.

Species richness in a single community (alpha diversity) is a classic concept for characterizing community diversity. The estimation of species richness has been extensively discussed in the literature; see Seber (1982), Bunge and Fitzpatrick (1993), Colwell and Coddington (1994), and Chao (2005) for reviews. For multiple communities, the number of shared species plays an important role for describing community overlap and forms a basis to construct various types of similarity indices or beta diversity. When compared with species richness in one community, the estimation of shared species richness in multiple communities has received relatively little attention. Although estimators for shared species richness in two communities were proposed (e.g., Chao et al. 2000; Chao, Shen, and Hwang 2006), these methods have not been extended to more than two communities.

It is intuitively understood that, if there are many undetectable or “invisible” species in a hyper-diverse community, then it is impossible to obtain a good estimate of species richness. Therefore, it is practically useful to provide a lower bound for species richness. A nonparametric lower bound in a single community was derived by Chao (1984, 1989) for abundance-based data and for replicated incidence (i.e., presence or absence) data. The Chao (1984, 1989) lower bound has been applied in various disciplines. For example, microbiologists used it to infer species richness in microbial hyper-diverse communities (Hughes et al. 2001; Bohannan and Hughes 2003; Stach et al. 2003; Schloss and Handelsman 2005). However, there have been no lower bounds proposed for the number of species shared by multiple communities.

Part of this research was initiated by analyzing soil ciliate species data collected in three areas of Namibia, southwest Africa by Foissner and colleagues (Foissner, Agatha, and Berger 2002). See Table 1 and Section 5 for data and detailed analysis. Questions concerning the alpha, beta, and gamma diversity of microorganisms and their biogeographical distribution (ubiquity or endemicity) have generated extensive discussion in the literature. However, previous diversity analysis for soil ciliate species (Chao et al. 2006) was limited to alpha diversity in a single community (or area). In order to investigate the community overlap or beta diversity based on multiple-community data in Namibia, we were motivated to estimate the number of species shared by at least two communities (or areas).

Data summary for three areas of Namibia (original data are given in Foissner, Agatha, and Berger 2002)

When sample abundance or replicated incidence records are available from each of the *N* (*N* ≥ 2) communities, we propose in this article a unified approach to constructing a nonparametric lower bound for the number of species shared by *N* communities. The proposed method is nonparametric in the sense that they are not dependent on the assumptions about the species abundance distribution (for abundance data) or the species detection probability (for replicated incidence data). Variance estimators for the proposed lower bounds are also obtained.

Since a review of the details on deriving the lower bound of species richness in a single community (Chao 1984, 1989) would greatly help to extend the framework to multiple communities, we provide such a review in Section 2 separately for abundance data (in Section 2.1) and replicated incidence data (in Section 2.2). In Section 3, we develop a lower bound for the number of species shared by two communities. In Section 4, a unified approach is described for the case of more than two communities. In Section 5, the replicated incidence data for ciliate species that motivated this research are analyzed as an illustrative example. Section 6 reports a simulation study in order to examine the performance of the proposed method. Some concluding remarks and relevant discussion are provided in Section 7.

We first review the lower bound of species richness (Chao 1984, 1989) in a single community. Assume that there are *S* species indexed from 1 to *S* and a fixed number of *n* individuals are independently observed in the community. Denote the species probabilities by (*θ*_{1}, *θ*_{2}, … , *θ _{S}*), where ${\sum}_{i=1}^{S}{\theta}_{i}=1$. That is,

Let *X _{i}* (species frequency) be the number of times, or individuals, that the

Let *f _{k}, k* = 0, 1, … ,

A parametric approach to estimating species richness is to assume that (*θ*_{1}, *θ*_{2}, … , *θ _{S}*) follows some types of distributions characterized by a few parameters. For example, Fisher, Corbet, and Williams (1943) assumed that ${\theta}_{i}={\lambda}_{i}\u2215{\sum}_{k=1}^{S}{\lambda}_{k}$, where (

Since *S* = *D* + *f*_{0}, our estimating target becomes E(*f*_{0}). Under the assumptions that (*θ*_{1}, *θ*_{2}, … , *θ _{S}*) are fixed unknown parameters, we have the following expectation, respectively, for the expected number of undetected species, singletons, and doubletons:

$$\mathrm{E}\left({f}_{0}\right)=\sum _{i=1}^{S}{(1-{\theta}_{i})}^{n},$$

(2.1a)

$$\mathrm{E}\left({f}_{1}\right)=\sum _{i=1}^{S}n{\theta}_{i}{(1-{\theta}_{i})}^{n-1},$$

(2.1b)

$$\mathrm{E}\left({f}_{2}\right)=\sum _{i=1}^{S}\left(\begin{array}{c}\hfill n\hfill \\ \hfill 2\hfill \end{array}\right){\theta}_{i}^{2}{(1-{\theta}_{i})}^{n-2}.$$

(2.1c)

Based on Equations (2.1a) to (2.1c), Chao (1989) used the following Cauchy–Schwarz inequality

$$\left[\sum _{i=1}^{S}{(1-{\theta}_{i})}^{n}\right]\left[\sum _{i=1}^{S}{\theta}_{i}^{2}{(1-{\theta}_{i})}^{n-2}\right]\ge {\left[\sum _{i=1}^{S}{\theta}_{i}{(1-{\theta}_{i})}^{n-1}\right]}^{2},$$

(2.2)

to obtain a theoretical bound for E(*f*_{0}):

$$\mathrm{E}\left({f}_{0}\right)\ge \frac{(n-1)}{n}\frac{{\left[\mathrm{E}\left({f}_{1}\right)\right]}^{2}}{2\mathrm{E}\left({f}_{2}\right)}.$$

(2.3)

The inequality becomes an equality if and only if all probabilities are equal (a homogeneous case). If *f*_{2} > 0, we can replace the expected values by the observed data in Equation (2.3) and a lower bound for species richness becomes:

$$\widehat{S}=D+\frac{(n-1)}{n}\frac{{f}_{1}^{2}}{2{f}_{2}},$$

(2.4)

with the bound being achieved under a homogeneous community. We remark that instead of treating (*θ*_{1}, *θ*_{2}, … , *θ _{S}*) as fixed parameters, they can be modeled as random effects selected from an unknown distribution. Under a random-effect model, parallel derivation results in the same estimator. A bias-corrected estimator in a homogeneous case turns out to be

$$\stackrel{~}{S}=D+{f}_{1}({f}_{1}-1)\u2215\left[2({f}_{2}+1)\right].$$

(2.4a)

The lower bound in Equation (2.4) was proposed by Chao (1984) using an alternative derivation. For abundance data, the sample size *n* is often large so that the term (*n* − 1)/*n* in the bound can be dropped and the estimator in Equation (2.4) is reduced to $D+{f}_{1}^{2}\u2215\left(2{f}_{2}\right)$. This simplified estimator has been referred to as the Chao1 estimator in the biological and ecological literature (e.g., Colwell and Coddington 1994; Walther and Morand 1998; Hughes et al. 2001). It is also featured in several computer software packages including EstimateS (Colwell 2004), DOTUR (Schloss and Handelsman 2005), SPADE (Chao and Shen 2003), and WS2m (Turner, Leitner, and Rosenzweig 1999).

An estimated variance formula derived in Chao and Shen (2003) for the estimator in Equation (2.4) is

$$\widehat{var}\left(\widehat{S}\right)={f}_{2}[0.25{K}^{2}{({f}_{1}\u2215{f}_{2})}^{4}+{K}^{2}{({f}_{1}\u2215{f}_{2})}^{3}+0.5K{({f}_{1}\u2215{f}_{2})}^{2}],$$

(2.5)

where *K* = (*n* − 1)/*n*. When *f*_{2} = 0, it is suggested using the bias-corrected form and the lower bound becomes $\stackrel{~}{S}=D+{f}_{1}({f}_{1}-1)\u22152$. In this instance, the variance formula is modified to

$$\widehat{var}\left(\stackrel{~}{S}\right)=0.25{f}_{1}{(2{f}_{1}-1)}^{2}+0.5{f}_{1}({f}_{1}-1)-0.25{f}_{1}^{4}\u2215\stackrel{~}{S}.$$

(2.6)

One advantage of using the Chao1 estimator is that the estimated number of undetected species depends only on the first two frequency counts, i.e., the numbers of singletons and doubletons. This implies that ecologists do not need to obtain the exact frequency of any species that has at least three individuals in the sample. The estimator is especially useful if counting the exact number of individuals for each species appearing in the sample requires substantial effort.

In many microorganism surveys, only species presence/absence data can be collected because there are too many individuals to be counted. For example, in the ciliate species data (Section 5) and other microbial data, it is not possible to count exactly the number of individuals and thus only the presence/absence of each observed species was recorded. Accordingly, only replicated incidence data were available.

Assume that there are *t* samples and they are indexed 1, 2, … , *t*. We use the general term “sample” which could also refer to a team, occasion, transect line, a fixed period of time, or an investigator. The presence or absence of any species for these *t* samples is recorded to form a species-by-sample incidence matrix. In most applications, sufficient statistics from the species-by-sample incidence matrix are the incidence-based frequency counts (*Q*_{1}, *Q*_{2}, … , *Q _{t}*), where

Assume that the species detection probabilities, defined as the chance of encountering at least one individual of a given species in any sample, are (*θ*_{1}, *θ*_{2}, … , *θ _{S}*) and these probabilities are kept constant across the samples. We remark that, unlike the constraint ${\sum}_{i=1}^{S}{\theta}_{i}=1$ in abundance data, ${\sum}_{i=1}^{S}{\theta}_{i}$ may be greater than 1 for incidence-based data.

Parallel derivations to those in Section 2.1 can be made with *n* being replaced by *t*, and the counts (*f*_{1}, *f*_{2}, … , *f _{n}*) replaced by (

This section extends our approach to the estimation of the number of species shared by two communities. Assume that there are *S*_{1} species in community I and there are *S*_{2} species in community II. The species probabilities in communities I and II are denoted (*θ*_{11}, *θ*_{21}, … , *θ*_{S1,1}) and (*θ*_{12}, *θ*_{22}, … , *θ*_{S2,2}), respectively. ${\sum}_{i=1}^{{S}_{1}}{\theta}_{i1}={\sum}_{i=1}^{{S}_{2}}{\theta}_{i2}=1$. Let of the number shared species be *S*_{12}. Without loss of generality, we assume that the first *S*_{12} species are the shared species.

Two random samples (sample I with size *n*_{1} and sample II with size *n*_{2}) are taken from communities I and II, respectively. Assume that *D*_{12} shared species are observed. Denote the observed frequencies in the two communities, respectively, by (*X*_{11}, *X*_{21}, … , *X*_{S1,1}) and (*X*_{12}, *X*_{22}, … , *X*_{S2,2}). Define for any two nonnegative integers *j* and *k*,

$${f}_{jk}=\sum _{i=1}^{{S}_{12}}I({X}_{i1}=j,{X}_{i2}=k),$$

(3.1a)

$${f}_{j+}=\sum _{i=1}^{{S}_{12}}I({X}_{i1}=j,{X}_{i2}\ge 1),$$

(3.1b)

$${f}_{+k}=\sum _{i=1}^{{S}_{12}}I({X}_{i1}\ge 1,{X}_{i2}=k).$$

(3.1c)

That is, *f _{jk}* denotes the number of

Since *S*_{12} = *D*_{12} + *f*_{+0} + *f*_{0+} + *f*_{00} and only *D*_{12} is observable, our approach is to find a lower bound for each of the expected values of the other three terms, i.e., E(*f*_{+0}), E(*f*_{0+}), and E(*f*_{00}). Assuming a multinomial model for each of the two sets of frequencies, we have

$$\mathrm{E}\left({f}_{00}\right)=\sum _{i=1}^{{S}_{12}}{(1-{\theta}_{i1})}^{{n}_{1}}{(1-{\theta}_{i2})}^{{n}_{2}},$$

(3.2a)

$$\mathrm{E}\left({f}_{+0}\right)=\sum _{i=1}^{{S}_{12}}[1-{(1-{\theta}_{i1})}^{{n}_{1}}]{(1-{\theta}_{i2})}^{{n}_{2}},$$

(3.2b)

$$\mathrm{E}\left({f}_{0+}\right)=\sum _{i=1}^{{S}_{12}}{(1-{\theta}_{i1})}^{{n}_{1}}[1-{(1-{\theta}_{i2})}^{{n}_{2}}].$$

(3.2c)

- A lower bound for E(
*f*_{+0}): Note we have$$\mathrm{E}\left({f}_{+1}\right)=\sum _{i=1}^{{S}_{12}}[1-{(1-{\theta}_{i1})}^{{n}_{1}}]{n}_{2}{\theta}_{i2}{(1-{\theta}_{i2})}^{{n}_{2}-1},$$The following Cauchy–Schwarz inequality$${\mathrm{E}}_{\left({f}_{+2}\right)}=\sum _{i=1}^{{S}_{12}}[1-{(1-{\theta}_{i1})}^{{n}_{1}}][{n}_{2}({n}_{2}-1)\u22152]{\theta}_{i2}^{2}{(1-{\theta}_{i2})}^{{n}_{2}-2}.$$leads to$$\left[\sum _{i=1}^{{S}_{12}}[1-{(1-{\theta}_{i1})}^{{n}_{1}}]{(1-{\theta}_{i2})}^{{n}_{2}}\right]\left[\sum _{i=1}^{{S}_{12}}[1-{(1-{\theta}_{i1})}^{{n}_{1}}]{\theta}_{i2}^{2}{(1-{\theta}_{i2})}^{{n}_{2}-2}\right]\ge {\left[\sum _{i=1}^{{S}_{12}}[1-{(1-{\theta}_{i1})}^{{n}_{1}}]{\theta}_{i2}{(1-{\theta}_{i2})}^{{n}_{2}-1}\right]}^{2},$$The equality holds if and only if community 2 is homogenous in species probabilities.$$\mathrm{E}\left({f}_{+0}\right)\ge \frac{({n}_{2}-1)}{{n}_{2}}\frac{{\left[\mathrm{E}\left({f}_{+1}\right)\right]}^{2}}{2\mathrm{E}\left({f}_{+2}\right)}.$$(3.3) - Similarly, a lower bound for E(
*f*_{0+}) isThe equality holds if and only if community 1 is homogenous in species probabilities.$$\mathrm{E}\left({f}_{0+}\right)\ge \frac{({n}_{1}-1)}{{n}_{1}}\frac{{\left[\mathrm{E}\left({f}_{1+}\right)\right]}^{2}}{2\mathrm{E}\left({f}_{2+}\right)}.$$(3.4) - A lower bound for E(
*f*_{00}) is obtained by noting$$\mathrm{E}\left({f}_{11}\right)=\sum _{i=1}^{{S}_{12}}{n}_{1}{\theta}_{i1}{(1-{\theta}_{i1})}^{{n}_{1}-1}{n}_{2}{\theta}_{i2}{(1-{\theta}_{i2})}^{{n}_{2}-1},$$Again, a similar Cauchy-Schwarz inequality$$\mathrm{E}\left({f}_{22}\right)=\sum _{i=1}^{{S}_{12}}[{n}_{1}({n}_{1}-1)\u22152]{\theta}_{i1}^{2}{(1-{\theta}_{i1})}^{{n}_{1}-2}[{n}_{2}({n}_{2}-1)\u22152]{\theta}_{i2}^{2}{(1-{\theta}_{i2})}^{{n}_{2}-2}.$$gives$$\left[\sum _{i=1}^{{S}_{12}}{(1-{\theta}_{i1})}^{{n}_{1}}{(1-{\theta}_{i2})}^{{n}_{2}}\right]\left[\sum _{i=1}^{{S}_{12}}{\theta}_{i1}^{2}{(1-{\theta}_{i1})}^{{n}_{1}-2}{\theta}_{i2}^{2}{(1-{\theta}_{i2})}^{{n}_{2}-2}\right]\ge {\left[\sum _{i=1}^{{S}_{12}}{\theta}_{i1}{(1-{\theta}_{i1})}^{{n}_{1}-1}{\theta}_{i2}{(1-{\theta}_{i2})}^{{n}_{2}-1}\right]}^{2},$$Combining the above three lower bounds, we thus have a lower bound for the shared species richness (let$$\mathrm{E}\left({f}_{00}\right)\ge \frac{({n}_{1}-1)}{{n}_{1}}\frac{({n}_{2}-1)}{{n}_{2}}\frac{{\left[\mathrm{E}\left({f}_{11}\right)\right]}^{2}}{4\mathrm{E}\left({f}_{22}\right)}.$$(3.5)*K*= (_{i}*n*− 1)/_{i}*n*)_{i}In many cases, the sample sizes$${\widehat{S}}_{12}={D}_{12}+{K}_{1}\frac{{f}_{1+}^{2}}{2{f}_{2+}}+{K}_{2}\frac{{f}_{+1}^{2}}{2{f}_{+2}}+{K}_{1}{K}_{2}\frac{{f}_{11}^{2}}{4{f}_{22}}.$$(3.6)*n*_{1}and*n*_{2}are large for abundance data; thus, the terms (*n*_{1}− 1)/*n*_{1}and (*n*_{2}− 1)/*n*_{2}can be dropped in the above formula. The estimator in Equation (3.6) can be regarded as an extension of the Chao1 estimator to two communities. When*f*_{2+}= 0 or*f*_{+2}= 0, a bias-corrected estimator isNote that only observed shared species are involved in the formulas of Equations (3.6) and (3.7), thus observed nonshared species play no role in our estimation, although any observed nonshared species could actually be a shared species. Because the proposed estimator can be regarded as a function of the statistics ($${\stackrel{~}{S}}_{12}={D}_{12}+{K}_{2}\frac{{f}_{+1}({f}_{+1}-1)}{2({f}_{+2}+1)}+{K}_{1}\frac{{f}_{1+}({f}_{1+}-1)}{2({f}_{2+}+1)}+{K}_{1}{K}_{2}\frac{{f}_{11}({f}_{11}-1)}{4({f}_{22}+1)}.$$(3.7)*D*_{12},*f*_{11},*f*_{22},*f*_{1+},*f*_{2+},*f*_{+1},*f*_{+2}), we obtain a variance estimator by using a standard asymptotic approach under a multinomial model. Then the estimated variance can be used to construct a confidence interval for the true parameter using a log-transformation (Chao 1987).

The method developed for abundance data can be directly adapted to deal with the replicated incidence case. All notation and model formulation are similar to those in Section 3.1. Assume that there are *t*_{1} samples randomly taken from community I and *t*_{2} samples from community II. In each sample, only presence/absence data are recorded. The two sets of probabilities (*θ*_{11}, *θ*_{21}, … , *θ*_{S1,1}) and (*θ*_{12}, *θ*_{22}, … , *θ*_{S2,2}) in the incidence case represent species detection probabilities in any sample from communities I and II, respectively.

Let *X*_{i1} and *X*_{i2} denote the number of samples that the *i*th species is detected in communities I and II, respectively. Let ${Q}_{jk}={\sum}_{i=1}^{{S}_{12}}I({X}_{i1}=j,{X}_{i2}=k)$ denote the number of *shared* species that are detected in *j* samples in community I and *k* samples in community II. Similarly, we can define *Q*_{j+} and *Q*_{+k}. By applying a method analogous to that in Section 3.1, it can be shown that the lower bound ${\widehat{S}}_{12}$ and the bias-corrected version ${\stackrel{~}{S}}_{12}$ for the number of shared species based on incidence counts have the same forms as in Equations (3.6) and (3.7), except that the samples sizes *n*_{1} and *n*_{2} should be, respectively, by *t*_{1} and *t*_{2}, and abundance counts replaced by incidence counts. We remark that an approximate estimator of shared species richness was derived in Chao, Shen, and Hwang (2006) for both types of data based on the Laplace approximation formula, but that estimator cannot be theoretically verified to be a lower bound.

The approach proposed in Section 3 has an obvious extension to the case of more than two communities. We first describe the derivation for three communities. Extension to more than three communities is direct. Here a “shared” species is defined as that the species belongs to *all* communities. Assume that there are *S*_{123} species shared by three communities I, II, and III and a random sample is taken from each of the three communities. The three samples are called samples I, II, and III with sizes *n*_{1}, *n*_{2}, and *n*_{3}, respectively. Let *D*_{123} denote the observed shared species richness in the three samples. Then

$${S}_{123}={D}_{123}+{f}_{++0}+{f}_{+0+}+{f}_{0++}+{f}_{00+}+{f}_{0+0}+{f}_{+00}+{f}_{000},$$

(4.1)

where *f*_{++0} denotes the number of *shared* species that are observed in samples I, II, but not observed in sample III, *f*_{000} denotes the number of *shared* species that are undetected in all three samples, and a similar interpretation for other terms in Equation (4.1).

- Based on a similar type of inequality as in Equations (3.3) and (3.4), we can get a lower bound for the expected value of each term of
*f*_{++0}+*f*_{+0+}+*f*_{0++}as shown in the second term to the fourth term in the right hand side of Equation (4.2). - Based on a similar type of inequality as in Equation (3.5), we can get a lower bound for the expected value of each term of
*f*_{00+}+*f*_{0+0}+*f*_{+00}as shown in the fifth term to the seventh term of Equation (4.2). - Extending Equation (3.5), we have a lower bound for E(
*f*_{000}) as showninthe last term of Equation (4.2).

Combining the above, we have a lower bound for *S*_{123} as follows:

$${\widehat{S}}_{123}={D}_{123}+{K}_{3}\frac{{f}_{++1}^{2}}{2{f}_{++2}}+{K}_{2}\frac{{f}_{+1+}^{2}}{2{f}_{+2+}}+{K}_{1}\frac{{f}_{1++}^{2}}{2{f}_{2++}}+{K}_{1}{K}_{2}\frac{{f}_{11+}^{2}}{4{f}_{22+}}+{K}_{1}{K}_{3}\frac{{f}_{1+1}^{2}}{4{f}_{2+2}}+{K}_{2}{K}_{3}\frac{{f}_{+11}^{2}}{4{f}_{+22}}+{K}_{1}{K}_{2}{K}_{3}\frac{{f}_{111}^{2}}{8{f}_{222}}.$$

(4.2)

An estimated variance can be obtained by an asymptotic method. Extending Equations (3.6) and (4.2) with self-explanatory notation generalization, we have a lower bound for four communities:

$${\widehat{S}}_{1234}={D}_{1234}+{K}_{1}\frac{{f}_{1+++}^{2}}{2{f}_{2+++}}+\cdots +{K}_{4}\frac{{f}_{+++1}^{2}}{2{f}_{+++2}}+{K}_{1}{K}_{2}\frac{{f}_{11++}^{2}}{4{f}_{22++}}+{K}_{1}{K}_{3}\frac{{f}_{1+1+}^{2}}{4{f}_{2+2+}}+\cdots +{K}_{3}{K}_{4}\frac{{f}_{++11}^{2}}{4{f}_{++22}}+{K}_{1}{K}_{2}{K}_{3}\frac{{f}_{111+}^{2}}{8{f}_{222+}}+{K}_{1}{K}_{2}{K}_{4}\frac{{f}_{11+1}^{2}}{8{f}_{22+2}}+\cdots +{K}_{2}{K}_{3}{K}_{4}\frac{{f}_{+111}^{2}}{8{f}_{+222}}+{K}_{1}{K}_{2}{K}_{3}{K}_{4}\frac{{f}_{1111}^{2}}{16{f}_{2222}}.$$

(4.3)

Thus, we have provided a unified approach to formulating lower bounds for any number of communities. However, the estimated variance estimator becomes quite complicated. Currently, we have variance estimators only up to five communities.

Based on Equations (4.2) and (4.3), similar lower bounds for replicated incidence data can be obtained by replacing frequency counts by incidence counts and each sample size by the number of samples. Variance estimators are derived in an analogous way.

A total of 51 soil samples were taken from three areas of Namibia; see Table 1 for a description of relevant data information. Generally, collections were made from a variety of soil and vegetation types of the respective area. About 10 small subsamples were taken from an area of about 100 m^{2} and mixed to a composite soil sample. In each soil sample, presence/absence of ciliate species was recorded. Species were determined by combining live observation, silver impregnation, and scanning electron microscopy. Detailed sampling locations, procedures, and species identification were described in Foissner (1999, 2006) and Foissner, Agatha, and Berger (2002). After presence/absence of soil ciliate species was recorded for each sample, the replicated incidence data were merged by species identity and a total of 331 species were recorded in our data. All data in EXCEL spreadsheets are available from the authors upon request.

We illustrate one-community species richness estimation for each area and shared species richness estimation for any two areas (three combinations) and for all three areas; see Table 2. We provide for each case a lower bound for species richness or shared species richness in Table 2. The communities considered in our applications are highly heterogeneous and thus we adopt the original form of estimators. That is, our estimates are calculated from Equations (2.4), (3.6), and (4.2) with sample size there being replaced by the number of samples and frequency counts being replaced by incidence counts. The bias-corrected formulas which are derived under a homogeneous case are not reported here. For each estimate, its associated SE (standard error) as well as the 95% confidence interval based on a log-transformation are also shown in Table 2. The percentage of undetected shared species with respect to the estimated minimum is given in the last column.

All estimates indicate that there are still a substantial fraction of undetected species and shared species in the current data. For the alpha diversity, on average, about 41% of species diversity is still undetected. This is consistent with the finding in Chao et al. (2006). For estimating shared species richness, the observed number of shared species substantially underestimates the true number of shared species. Our approach reveals the extent of under-estimation for the observed number of shared (an average 42% for any two areas and 48% for three areas) and provides helpful information for understanding community overlap of micro-organisms.

Since our data analysis was based on replicated incidence data, we carried out a simulation study to investigate the performance of the proposed lower bounds for such kinds of data. We examined the shared species richness estimation for two and three communities. In each community, five types of species detection probabilities were considered: one homogeneous and four heterogeneous communities with 200 species in each. The five sets of species detection probabilities along with their average ( ) and coefficient of variation (CV) are given as follows:

- Type I:
*θ*= 0.10,_{i}*i*= 1, … , 200 ( = 0.10, CV = 0.0), - Type II:
*θ*~ Uniform(0, 1),_{i}*i*= 1, … , 200 ( = 0.50, CV = 0.58), - Type III:
*θ*~ Beta(1, 2),_{i}*i*= 1, … , 200 ( = 0.33, CV = 0.71), - Type IV:
*θ*= 10/(_{i}*i*+ 10),*i*= 1, … , 200 ( = 0.15, CV = 1.01), - Type V:
*θ*= 3/(_{i}*i*+ 3),*i*= 1, … , 200 ( = 0.06, CV = 1.55).

Type I denotes a homogeneous case, i.e., all species have the same probability of being detected in any sample; Types II and III assume that the probabilities represent a random sample, respectively, from a uniform or a beta density. (That is, for each simulation trial, we generated a sample of size of 200 as species probabilities.) Types IV and V are in a form of truncated logarithmic series, which is widely prevalent in modeling natural frequency data. It is also called Zipf’s law in linguistics and behavioral sciences. The value of CV characterizes the degree of heterogeneity among detection probabilities.

We considered all 15 possible combination cases of any two communities: (I versus I), (I versus II), … , (V versus V) as our target communities. We assume that the first 120 species are the shared species. Thus *S*_{1} = *S*_{2} = 200 and *S*_{12} = 120. Table 3 presents the simulation results for the case of two communities.

Simulation results for two communities (the true parameter *S*_{12} = 120; simulation trials = 2000), ${\widehat{S}}_{12}$: the original lower bound; ${\stackrel{~}{S}}_{12}$: the bias-corrected form

For three communities, we considered 35 possible combination cases of any three communities: (I, I, I), (I, I, II), … , (V, V, V). We assumed that *S*_{1} = *S*_{2} = *S*_{3} = 200, *S*_{12} = *S*_{13} = *S*_{23} = 120, and *S*_{123} = 80. The overlap structure is described as follows: (a) the first 80 species in each community are shared by all three communities, (b) the last 40 species in each community are unique species, and (c) the species shared by communities I and II are the 81 ~ 120th species in community I and the 81 ~ 120th species in community II; the species shared by communities I and III are the 121 ~ 160th species in community I and the 81 ~ 120th species in community III; and the species shared by communities II and III are the 121 ~ 160th species in community II and the 121 ~ 160th species in community III. Table 4 presents the simulation results for the case of three communities.

Simulation results for three communities (the true parameter *S*_{123} = 80; simulation trials = 2000), ${\widehat{S}}_{12}$: the original lower bound; ${\stackrel{~}{S}}_{12}$ the bias-corrected form

For any fixed combination of communities, we generated 20 replicated incidence samples from each community according to a specified type of detection probabilities. Then for each generated dataset, the observed number of shared species was recorded; the original lower bound ${\widehat{S}}_{12}$ (or ${\widehat{S}}_{123}$) and the bias-corrected version ${\stackrel{~}{S}}_{12}$ (or ${\stackrel{~}{S}}_{123}$) as well as their SE estimates and associated 95% confidence intervals were obtained. The resulting averages in Tables Tables33 and and44 were based on 2000 simulation trials. The percentage of 2000 simulated data sets in which the 95% confidence intervals covered the true parameter was recorded and shown in the last column in each table.

From the two tables, the traditional approach of using the observed number of shared species as an estimator of shared species richness is clearly not appropriate. The observed number of shared species exhibits severely negative bias in all cases. When at least one set of detection probabilities is Type V (low average probability and high heterogeneity), the bias is substantial.

The performance of the lower bounds as estimators of shared species richness improves when more shared information are available. The magnitudes of bias, sample SE, and sample RMSE decrease as more shared species are observed. The bias-corrected bound is always lower than the original bound, but these two bounds are generally comparable with respect to RMSE. In terms of bias, the bias-corrected bound is useful when all communities are homogeneous as in the case (I versus I) in Table 3 or when there are at least two communities are homogeneous as in the three cases (I, I, I), (I, I, IV), and (I, I, V) in Table 4. This is expected because the bias-corrected form is derived under a homogeneous condition. Thus, unless in the special case that most communities are homogeneous, we suggest using the original lower bound. Since the CV of species detection probabilities measures the degree of heterogeneity, a CV estimator can be used to quantify the degree of heterogeneity present in data; see Chao et al. (2000).

When a sufficient amount of shared information is available (say, at least 70% of the shared species are observed), the lower bound in most cases is close to the true parameter. Thus it can be used as an estimator of shared species richness. When there are not sufficient shared data, our approach only provides a reliable lower bound. The magnitude of downwards bias mainly depends on the average and CV of the detection probabilities as well as the number of replicated samples. Further work is needed to determine more sophisticated guidelines about how large the samples should be to provide sufficient shared information.

Simulations also show that the estimated standard errors using the asymptotic method, although biased slightly downwards, are generally satisfactory when compared with the sample standard errors. The confidence interval based on the estimated SE for the original estimator performs reasonably well as most coverage probabilities are close to the anticipated nominal confidence coefficient of 95%.

Using the Cauchy–Schwarz inequality for the expected frequency counts based on abundance or replicated incidence data, we have developed a simple and useful lower bound for the number of species shared by multiple communities. The proposed lower bounds for abundance and replicated incidence data are natural extensions of the previous estimators used for a single community. Simulation results have shown that the performance of the lower bounds under several types of abundance distributions is generally satisfactory. The estimators discussed in this article will be featured in Program SPADE (Species Prediction And Diversity Estimation) following publication of this article (Chao and Shen 2003).

For estimating species richness in one community, we have discussed in Section 2 that our lower bound for the undetected species is in terms of the number of singletons and doubletons (for abundance data) or of uniques and duplicates (for replicated incidence data). Similar advantage holds for the case of two communities. For example, Equation (3.6) implies that the estimated number of undetected shared species for abundance data requires only information of the frequencies *f*_{1+}, *f*_{+1}, *f*_{2+}, *f*_{+2}, *f*
_{11}, and *f*_{22}. As a result, having the exact species frequency is not necessary for species that have at least three individuals in any of the two communities. Parallel conclusions are also valid for replicated incidence data and for more than two communities.

One critical assumption about our sampling model for abundance data is that we assume that individuals are randomly selected *with* replacement from each of the target community. Under this assumption, the species frequencies follow a multinomial distribution. However, in the case of sampling *without* replacement, the corresponding distribution becomes a generalized hyper-geometric distribution, which is less mathematically tractable. Besides, sampling fraction (i.e., the ratio of sample size and total population size) should be considered in the model framework. Research on the sampling without replacement is still undergoing. Also, for multiple incidence data, one restrictive assumption is that the species detection probability, although it is allowed to vary among species, is kept as a constant across all samples. This assumption may not be satisfied if samples are taken from areas where species occurrences are spatially aggregated.

In our proposed lower bounds, we did not consider relevant covariate information such as distance between communities and habitat types. Hillebrand et al. (2001) and Green et al. (2004) used species overlap information to assess the similarity of microbes as a function of geographic distance. These authors discovered the distance-decay relationship for microbial assemblages. Thus, the communities that are similar (close geographically and similar habitat) would generally have more overlap than one farther apart. How to incorporate covariate information in the estimation of shared species richness merits more research.

Boulinier et al. (1998) pointed out a simple analogy between the species replicated incidence data in a community and capture-recapture studies of a closed population. Thus, the estimation of species richness in a community based on replicated incidence data is equivalent to the estimation of the size of a population based on capture-recapture data. The analogy can be extended to the general case of multiple communities. That is, the estimation of shared species richness based on multiple incidence data from each community is equivalent to the estimation of the size of a shared population based on capture-recapture data from each population. Consequently, the proposed methodology for replicated incidence data can be directly applied to estimate the size of a shared population. This application and relevant topics are currently under investigation; see Chao, Pan, and Chiang (2008).

This work was supported by the Taiwan National Science Council (Project 96-2118-M007-001) to HYP and AC, and by the Austrian Science Foundation (FWF project P19699-B17) to WF. Part of the material is based on the Ph.D. work of the first author under the supervision of the second author. The authors thank the Editor, Associate Editor, and two reviewers for carefully reading the manuscript and providing very thoughtful comments and suggestions, which significantly improved the article.

- Bohannan BJM, Hughes J. New Approaches to Analyzing Microbial Biodiversity Data. Current Opinion in Microbiology. 2003;6:182–187. [PubMed]
- Boulinier T, Nichols JD, Sauer JR, Hines JE, Pollock KH. Estimating Species Richness: The Importance of Heterogeneity in Species Detectability. Ecology. 1998;79:1018–1028.
- Bunge J, Fitzpatrick M. Estimating the Number of Species: A Review. Journal of the American Statistical Association. 1993;88:364–373.
- Chao A. Nonparametric Estimation of the Number of Classes in a Population. Scandinavian Journal of Statistics. 1984;11:265–270.
- Chao A. Estimating the Population Size for Capture-Recapture Data With Unequal Catchability. Biometrics. 1987;43:783–791. [PubMed]
- Chao A. Estimating Population Size for Sparse Data in Capture-Recapture Experiments. Biometrics. 1989;45:427–438.
- Chao A. Species Estimation and Applications. In: Balakrishnan N, Read CB, Vidakovic B, editors. Encyclopedia of Statistical Sciences. 2nd ed. Vol. 12. Wiley; New York: 2005. pp. 7907–7916.
- Chao A, Shen TJ. Program SPADE (Species Prediction And Diversity Estimation) 2003. program and user’s guide at http://chao.stat.nthu.edu.tw.
- Chao A, Hwang W-H, Chen Y-C, Kuo C-Y. Estimating the Number of Shared Species in Two Communities. Statistica Sinica. 2000;10:227–246.
- Chao A, Li PC, Agatha S, Foissner W. A Statistical Approach to Estimate Soil Ciliate Diversity and Distribution Based on Data From Five Continents. Oikos. 2006;114:479–493.
- Chao A, Pan HY, Chiang SC. The Petersen–Lincoln Estimator and Its Extension to Estimate the Size of a Shared Population. Biometrical Journal. 2008;50:957–970. [PubMed]
- Chao A, Shen T-J, Hwang W-H. Application of Laplace’s Boundary-Mode Approximations to Estimate Species and Shared Species Richness. Australian and New Zealand Journal of Statistics. 2006;48:117–128.
- Colwell RK. EstimateS: Statistical Estimation of Species Richness and Shared Species From Samples. 2004. Version 7.5, user’s guide and application published at http://viceroy.eeb.uconn.edu/estimates.
- Colwell RK, Coddington JA. Estimating Terrestrial Biodiversity Through Extrapolation. Philosophical Transactions of the Royal Society of London B—Biological Sciences. 1994;345:101–118. [PubMed]
- Fisher RA, Corbet AS, Williams CB. The Relation Between the Number of Species and the Number of Individuals in a Random Sample of an Animal Population. Journal of Animal Ecology. 1943;12:42–58.
- Foissner W. Protist Diversity: Estimates of the Near Imponderable. Protist. 1999;150:363–368. [PubMed]
- Foissner W. Biogeography and Dispersal of Micro-Organisms: A Review Emphasizing Protists. Acta Protozoologica. 2006;45:111–136.
- Foissner W, Agatha S, Berger H. Soil Ciliates (Protozoa, Ciliophora) From Namibia (Southwest Africa), With Emphasis on Two Contrasting Environments, the Etosha Region and the Namib Desert. Denisia. 2002;5:1–1459.
- Green J, Holmes AJ, Westoby M, Oliver I, Brlscoe D, Dangerfield M, Gillings M, Beattle AJ. Spatial Scaling of Microbial Eukaryote Diversity. Nature. 2004;432:747–753. [PubMed]
- Hillebrand H, Watermann F, Karez R, Berninger U-G. Differences in Species Richness Patterns Between Unicellular and Multicellular Organisms. Oecologia. 2001;126:114–124.
- Hughes JB, Hellmann JJ, Ricketts TH, Bohannan BJM. Counting the Uncountable: Statistical Approaches to Estimating Microbial Diversity. Applied and Environmental Microbiology. 2001;67:4399–4406. [PMC free article] [PubMed]
- MacArthur RH. On the Relative Abundances of Bird Species. Proceedings of the National Academy of Sciences. 1957;43:193–295.
- Magurran AE. Measuring Biological Diversity. Blackwell; Oxford, U.K.: 2004.
- Schloss PD, Handelsman J. Introducing DOTUR, a Computer Program for Defining Operational Taxonomic Units and Estimating Species Richness. Applied and Environmental Microbiology. 2005;71:1501–1506. [PMC free article] [PubMed]
- Seber GAF. The Estimation of Animal Abundance. 2nd ed. Griffin; London: 1982.
- Stach JEM, Maldonado LA, Masson DG, Ward AC, Goodfellow WM, Bull AT. Statistical Approaches for Estimating Actinobacterial Diversity in Marine Sediments. Applied Environmental Microbiology. 2003;69:6189–6200. [PMC free article] [PubMed]
- Turner W, Leitner W, Rosenzweig ML. WS2m: Software for Estimating Diversity. 1999. program and user’s guide at http://eebweb.arizona.edu/diversity.
- Walther BA, Morand S. Comparative Performance of Species Richness Estimation Methods. Parasitology. 1998;116:395–405. [PubMed]

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |