|Home | About | Journals | Submit | Contact Us | Français|
In biological and ecological statistical inference, it is practically useful to provide a lower bound for species richness in a community. Chao (1984, 1989) derived a nonparametric lower bound for species richness in a single community. However, there have been no lower bounds proposed in the literature for the number of species shared by multiple communities. Based on sample species abundance or replicated incidence records from each of the N communities, we derive in this article a nonparametric approach to constructing a lower bound for the number of species shared by N (N ≥ 2) communities. The approach is valid for all types of species abundance distributions (for abundance data) or species detection probabilities (for replicated incidence data). Variance estimators for the proposed lower bounds are obtained by using typical asymptotic theory. Simulation results are reported to examine the performance of the lower bounds. Replicated incidence data of ciliate species collected in three areas from Namibia, southwest Africa, are used for illustration. We also briefly discuss the application of the proposed method to estimate the size of a shared population (i.e., the number of individuals in the intersection of multiple populations) based on capture-recapture data from each population.
Species richness in a single community (alpha diversity) is a classic concept for characterizing community diversity. The estimation of species richness has been extensively discussed in the literature; see Seber (1982), Bunge and Fitzpatrick (1993), Colwell and Coddington (1994), and Chao (2005) for reviews. For multiple communities, the number of shared species plays an important role for describing community overlap and forms a basis to construct various types of similarity indices or beta diversity. When compared with species richness in one community, the estimation of shared species richness in multiple communities has received relatively little attention. Although estimators for shared species richness in two communities were proposed (e.g., Chao et al. 2000; Chao, Shen, and Hwang 2006), these methods have not been extended to more than two communities.
It is intuitively understood that, if there are many undetectable or “invisible” species in a hyper-diverse community, then it is impossible to obtain a good estimate of species richness. Therefore, it is practically useful to provide a lower bound for species richness. A nonparametric lower bound in a single community was derived by Chao (1984, 1989) for abundance-based data and for replicated incidence (i.e., presence or absence) data. The Chao (1984, 1989) lower bound has been applied in various disciplines. For example, microbiologists used it to infer species richness in microbial hyper-diverse communities (Hughes et al. 2001; Bohannan and Hughes 2003; Stach et al. 2003; Schloss and Handelsman 2005). However, there have been no lower bounds proposed for the number of species shared by multiple communities.
Part of this research was initiated by analyzing soil ciliate species data collected in three areas of Namibia, southwest Africa by Foissner and colleagues (Foissner, Agatha, and Berger 2002). See Table 1 and Section 5 for data and detailed analysis. Questions concerning the alpha, beta, and gamma diversity of microorganisms and their biogeographical distribution (ubiquity or endemicity) have generated extensive discussion in the literature. However, previous diversity analysis for soil ciliate species (Chao et al. 2006) was limited to alpha diversity in a single community (or area). In order to investigate the community overlap or beta diversity based on multiple-community data in Namibia, we were motivated to estimate the number of species shared by at least two communities (or areas).
When sample abundance or replicated incidence records are available from each of the N (N ≥ 2) communities, we propose in this article a unified approach to constructing a nonparametric lower bound for the number of species shared by N communities. The proposed method is nonparametric in the sense that they are not dependent on the assumptions about the species abundance distribution (for abundance data) or the species detection probability (for replicated incidence data). Variance estimators for the proposed lower bounds are also obtained.
Since a review of the details on deriving the lower bound of species richness in a single community (Chao 1984, 1989) would greatly help to extend the framework to multiple communities, we provide such a review in Section 2 separately for abundance data (in Section 2.1) and replicated incidence data (in Section 2.2). In Section 3, we develop a lower bound for the number of species shared by two communities. In Section 4, a unified approach is described for the case of more than two communities. In Section 5, the replicated incidence data for ciliate species that motivated this research are analyzed as an illustrative example. Section 6 reports a simulation study in order to examine the performance of the proposed method. Some concluding remarks and relevant discussion are provided in Section 7.
We first review the lower bound of species richness (Chao 1984, 1989) in a single community. Assume that there are S species indexed from 1 to S and a fixed number of n individuals are independently observed in the community. Denote the species probabilities by (θ1, θ2, … , θS), where . That is, θi the probability that any randomly selected individual is classified to the ith species. Each probability is a combination of species abundance and individual detectability. If all individuals in the community have the same probabilities of being detected, then the species probabilities represent the true relative abundances.
Let Xi (species frequency) be the number of times, or individuals, that the ith species is observed in the sample, i = 1, 2, … , S. Only those species with Xi > 0 are observable in the sample. The species frequencies (X1, X2, … , XS) are assumed to follow a multinomial distribution with cell total n and probabilities (θ1, θ2, … , θS).
Let fk, k = 0, 1, … , n, (frequency counts) be the number of species represented by k times, or individuals, in the sample. That is, , where I(A) is the usual indicator function, i.e., I(A) = 1 if the event A occurs, and 0 otherwise. Here, f0 denotes the number of undetected species in the sample. Thus, we have . Let D denote the number of distinct species observed in the sample, that is, .
A parametric approach to estimating species richness is to assume that (θ1, θ2, … , θS) follows some types of distributions characterized by a few parameters. For example, Fisher, Corbet, and Williams (1943) assumed that , where (λ1, λ2, … , λS) are a random sample from a gamma distribution. MacArthur’s (1957) broken-stick model assumed that (λ1, λ2, … , λS) are a random sample from an exponential distribution. There are other types of abundance distributions; see Magurran (2004) for a review.
Since S = D + f0, our estimating target becomes E(f0). Under the assumptions that (θ1, θ2, … , θS) are fixed unknown parameters, we have the following expectation, respectively, for the expected number of undetected species, singletons, and doubletons:
to obtain a theoretical bound for E(f0):
The inequality becomes an equality if and only if all probabilities are equal (a homogeneous case). If f2 > 0, we can replace the expected values by the observed data in Equation (2.3) and a lower bound for species richness becomes:
with the bound being achieved under a homogeneous community. We remark that instead of treating (θ1, θ2, … , θS) as fixed parameters, they can be modeled as random effects selected from an unknown distribution. Under a random-effect model, parallel derivation results in the same estimator. A bias-corrected estimator in a homogeneous case turns out to be
The lower bound in Equation (2.4) was proposed by Chao (1984) using an alternative derivation. For abundance data, the sample size n is often large so that the term (n − 1)/n in the bound can be dropped and the estimator in Equation (2.4) is reduced to . This simplified estimator has been referred to as the Chao1 estimator in the biological and ecological literature (e.g., Colwell and Coddington 1994; Walther and Morand 1998; Hughes et al. 2001). It is also featured in several computer software packages including EstimateS (Colwell 2004), DOTUR (Schloss and Handelsman 2005), SPADE (Chao and Shen 2003), and WS2m (Turner, Leitner, and Rosenzweig 1999).
where K = (n − 1)/n. When f2 = 0, it is suggested using the bias-corrected form and the lower bound becomes . In this instance, the variance formula is modified to
One advantage of using the Chao1 estimator is that the estimated number of undetected species depends only on the first two frequency counts, i.e., the numbers of singletons and doubletons. This implies that ecologists do not need to obtain the exact frequency of any species that has at least three individuals in the sample. The estimator is especially useful if counting the exact number of individuals for each species appearing in the sample requires substantial effort.
In many microorganism surveys, only species presence/absence data can be collected because there are too many individuals to be counted. For example, in the ciliate species data (Section 5) and other microbial data, it is not possible to count exactly the number of individuals and thus only the presence/absence of each observed species was recorded. Accordingly, only replicated incidence data were available.
Assume that there are t samples and they are indexed 1, 2, … , t. We use the general term “sample” which could also refer to a team, occasion, transect line, a fixed period of time, or an investigator. The presence or absence of any species for these t samples is recorded to form a species-by-sample incidence matrix. In most applications, sufficient statistics from the species-by-sample incidence matrix are the incidence-based frequency counts (Q1, Q2, … , Qt), where Qk denotes the number of species that are detected in exactly k samples, k = 1, 2, … , t. Hence, Q1 represents the number of “unique” species (those that are detected in only one sample) and Q2 represents the number of “duplicate” species (those that are detected in only two samples).
Assume that the species detection probabilities, defined as the chance of encountering at least one individual of a given species in any sample, are (θ1, θ2, … , θS) and these probabilities are kept constant across the samples. We remark that, unlike the constraint in abundance data, may be greater than 1 for incidence-based data.
Parallel derivations to those in Section 2.1 can be made with n being replaced by t, and the counts (f1, f2, … , fn) replaced by (Q1, Q2, … , Qt). Therefore, an estimator based on t replicated incidence records for multiple samples has the form , which is referred to in the literature as the Chao2 estimator. The number of samples t for incidence data may not be large, so we suggest retaining the term (t − 1)/t in the estimator. This estimator was originally derived by Chao (1987) for capture-recapture data as a lower bound. A bias-corrected form is . See Chao and Shen (2003) for an approximate variance formula.
This section extends our approach to the estimation of the number of species shared by two communities. Assume that there are S1 species in community I and there are S2 species in community II. The species probabilities in communities I and II are denoted (θ11, θ21, … , θS1,1) and (θ12, θ22, … , θS2,2), respectively. . Let of the number shared species be S12. Without loss of generality, we assume that the first S12 species are the shared species.
Two random samples (sample I with size n1 and sample II with size n2) are taken from communities I and II, respectively. Assume that D12 shared species are observed. Denote the observed frequencies in the two communities, respectively, by (X11, X21, … , XS1,1) and (X12, X22, … , XS2,2). Define for any two nonnegative integers j and k,
That is, fjk denotes the number of shared species that are observed j times in sample I and k times in sample II. In particular, f11 denotes the number of shared species that are singletons in both samples, and f00 denotes the number of shared species that are undetected in both samples. Also, f+0 denotes the number of shared species that are observed in sample I but not observed in sample II, and a similar interpretation for f0+.
Since S12 = D12 + f+0 + f0+ + f00 and only D12 is observable, our approach is to find a lower bound for each of the expected values of the other three terms, i.e., E(f+0), E(f0+), and E(f00). Assuming a multinomial model for each of the two sets of frequencies, we have
The method developed for abundance data can be directly adapted to deal with the replicated incidence case. All notation and model formulation are similar to those in Section 3.1. Assume that there are t1 samples randomly taken from community I and t2 samples from community II. In each sample, only presence/absence data are recorded. The two sets of probabilities (θ11, θ21, … , θS1,1) and (θ12, θ22, … , θS2,2) in the incidence case represent species detection probabilities in any sample from communities I and II, respectively.
Let Xi1 and Xi2 denote the number of samples that the ith species is detected in communities I and II, respectively. Let denote the number of shared species that are detected in j samples in community I and k samples in community II. Similarly, we can define Qj+ and Q+k. By applying a method analogous to that in Section 3.1, it can be shown that the lower bound and the bias-corrected version for the number of shared species based on incidence counts have the same forms as in Equations (3.6) and (3.7), except that the samples sizes n1 and n2 should be, respectively, by t1 and t2, and abundance counts replaced by incidence counts. We remark that an approximate estimator of shared species richness was derived in Chao, Shen, and Hwang (2006) for both types of data based on the Laplace approximation formula, but that estimator cannot be theoretically verified to be a lower bound.
The approach proposed in Section 3 has an obvious extension to the case of more than two communities. We first describe the derivation for three communities. Extension to more than three communities is direct. Here a “shared” species is defined as that the species belongs to all communities. Assume that there are S123 species shared by three communities I, II, and III and a random sample is taken from each of the three communities. The three samples are called samples I, II, and III with sizes n1, n2, and n3, respectively. Let D123 denote the observed shared species richness in the three samples. Then
where f++0 denotes the number of shared species that are observed in samples I, II, but not observed in sample III, f000 denotes the number of shared species that are undetected in all three samples, and a similar interpretation for other terms in Equation (4.1).
Combining the above, we have a lower bound for S123 as follows:
Thus, we have provided a unified approach to formulating lower bounds for any number of communities. However, the estimated variance estimator becomes quite complicated. Currently, we have variance estimators only up to five communities.
Based on Equations (4.2) and (4.3), similar lower bounds for replicated incidence data can be obtained by replacing frequency counts by incidence counts and each sample size by the number of samples. Variance estimators are derived in an analogous way.
A total of 51 soil samples were taken from three areas of Namibia; see Table 1 for a description of relevant data information. Generally, collections were made from a variety of soil and vegetation types of the respective area. About 10 small subsamples were taken from an area of about 100 m2 and mixed to a composite soil sample. In each soil sample, presence/absence of ciliate species was recorded. Species were determined by combining live observation, silver impregnation, and scanning electron microscopy. Detailed sampling locations, procedures, and species identification were described in Foissner (1999, 2006) and Foissner, Agatha, and Berger (2002). After presence/absence of soil ciliate species was recorded for each sample, the replicated incidence data were merged by species identity and a total of 331 species were recorded in our data. All data in EXCEL spreadsheets are available from the authors upon request.
We illustrate one-community species richness estimation for each area and shared species richness estimation for any two areas (three combinations) and for all three areas; see Table 2. We provide for each case a lower bound for species richness or shared species richness in Table 2. The communities considered in our applications are highly heterogeneous and thus we adopt the original form of estimators. That is, our estimates are calculated from Equations (2.4), (3.6), and (4.2) with sample size there being replaced by the number of samples and frequency counts being replaced by incidence counts. The bias-corrected formulas which are derived under a homogeneous case are not reported here. For each estimate, its associated SE (standard error) as well as the 95% confidence interval based on a log-transformation are also shown in Table 2. The percentage of undetected shared species with respect to the estimated minimum is given in the last column.
All estimates indicate that there are still a substantial fraction of undetected species and shared species in the current data. For the alpha diversity, on average, about 41% of species diversity is still undetected. This is consistent with the finding in Chao et al. (2006). For estimating shared species richness, the observed number of shared species substantially underestimates the true number of shared species. Our approach reveals the extent of under-estimation for the observed number of shared (an average 42% for any two areas and 48% for three areas) and provides helpful information for understanding community overlap of micro-organisms.
Since our data analysis was based on replicated incidence data, we carried out a simulation study to investigate the performance of the proposed lower bounds for such kinds of data. We examined the shared species richness estimation for two and three communities. In each community, five types of species detection probabilities were considered: one homogeneous and four heterogeneous communities with 200 species in each. The five sets of species detection probabilities along with their average ( ) and coefficient of variation (CV) are given as follows:
Type I denotes a homogeneous case, i.e., all species have the same probability of being detected in any sample; Types II and III assume that the probabilities represent a random sample, respectively, from a uniform or a beta density. (That is, for each simulation trial, we generated a sample of size of 200 as species probabilities.) Types IV and V are in a form of truncated logarithmic series, which is widely prevalent in modeling natural frequency data. It is also called Zipf’s law in linguistics and behavioral sciences. The value of CV characterizes the degree of heterogeneity among detection probabilities.
We considered all 15 possible combination cases of any two communities: (I versus I), (I versus II), … , (V versus V) as our target communities. We assume that the first 120 species are the shared species. Thus S1 = S2 = 200 and S12 = 120. Table 3 presents the simulation results for the case of two communities.
For three communities, we considered 35 possible combination cases of any three communities: (I, I, I), (I, I, II), … , (V, V, V). We assumed that S1 = S2 = S3 = 200, S12 = S13 = S23 = 120, and S123 = 80. The overlap structure is described as follows: (a) the first 80 species in each community are shared by all three communities, (b) the last 40 species in each community are unique species, and (c) the species shared by communities I and II are the 81 ~ 120th species in community I and the 81 ~ 120th species in community II; the species shared by communities I and III are the 121 ~ 160th species in community I and the 81 ~ 120th species in community III; and the species shared by communities II and III are the 121 ~ 160th species in community II and the 121 ~ 160th species in community III. Table 4 presents the simulation results for the case of three communities.
For any fixed combination of communities, we generated 20 replicated incidence samples from each community according to a specified type of detection probabilities. Then for each generated dataset, the observed number of shared species was recorded; the original lower bound (or ) and the bias-corrected version (or ) as well as their SE estimates and associated 95% confidence intervals were obtained. The resulting averages in Tables Tables33 and and44 were based on 2000 simulation trials. The percentage of 2000 simulated data sets in which the 95% confidence intervals covered the true parameter was recorded and shown in the last column in each table.
From the two tables, the traditional approach of using the observed number of shared species as an estimator of shared species richness is clearly not appropriate. The observed number of shared species exhibits severely negative bias in all cases. When at least one set of detection probabilities is Type V (low average probability and high heterogeneity), the bias is substantial.
The performance of the lower bounds as estimators of shared species richness improves when more shared information are available. The magnitudes of bias, sample SE, and sample RMSE decrease as more shared species are observed. The bias-corrected bound is always lower than the original bound, but these two bounds are generally comparable with respect to RMSE. In terms of bias, the bias-corrected bound is useful when all communities are homogeneous as in the case (I versus I) in Table 3 or when there are at least two communities are homogeneous as in the three cases (I, I, I), (I, I, IV), and (I, I, V) in Table 4. This is expected because the bias-corrected form is derived under a homogeneous condition. Thus, unless in the special case that most communities are homogeneous, we suggest using the original lower bound. Since the CV of species detection probabilities measures the degree of heterogeneity, a CV estimator can be used to quantify the degree of heterogeneity present in data; see Chao et al. (2000).
When a sufficient amount of shared information is available (say, at least 70% of the shared species are observed), the lower bound in most cases is close to the true parameter. Thus it can be used as an estimator of shared species richness. When there are not sufficient shared data, our approach only provides a reliable lower bound. The magnitude of downwards bias mainly depends on the average and CV of the detection probabilities as well as the number of replicated samples. Further work is needed to determine more sophisticated guidelines about how large the samples should be to provide sufficient shared information.
Simulations also show that the estimated standard errors using the asymptotic method, although biased slightly downwards, are generally satisfactory when compared with the sample standard errors. The confidence interval based on the estimated SE for the original estimator performs reasonably well as most coverage probabilities are close to the anticipated nominal confidence coefficient of 95%.
Using the Cauchy–Schwarz inequality for the expected frequency counts based on abundance or replicated incidence data, we have developed a simple and useful lower bound for the number of species shared by multiple communities. The proposed lower bounds for abundance and replicated incidence data are natural extensions of the previous estimators used for a single community. Simulation results have shown that the performance of the lower bounds under several types of abundance distributions is generally satisfactory. The estimators discussed in this article will be featured in Program SPADE (Species Prediction And Diversity Estimation) following publication of this article (Chao and Shen 2003).
For estimating species richness in one community, we have discussed in Section 2 that our lower bound for the undetected species is in terms of the number of singletons and doubletons (for abundance data) or of uniques and duplicates (for replicated incidence data). Similar advantage holds for the case of two communities. For example, Equation (3.6) implies that the estimated number of undetected shared species for abundance data requires only information of the frequencies f1+, f+1, f2+, f+2, f 11, and f22. As a result, having the exact species frequency is not necessary for species that have at least three individuals in any of the two communities. Parallel conclusions are also valid for replicated incidence data and for more than two communities.
One critical assumption about our sampling model for abundance data is that we assume that individuals are randomly selected with replacement from each of the target community. Under this assumption, the species frequencies follow a multinomial distribution. However, in the case of sampling without replacement, the corresponding distribution becomes a generalized hyper-geometric distribution, which is less mathematically tractable. Besides, sampling fraction (i.e., the ratio of sample size and total population size) should be considered in the model framework. Research on the sampling without replacement is still undergoing. Also, for multiple incidence data, one restrictive assumption is that the species detection probability, although it is allowed to vary among species, is kept as a constant across all samples. This assumption may not be satisfied if samples are taken from areas where species occurrences are spatially aggregated.
In our proposed lower bounds, we did not consider relevant covariate information such as distance between communities and habitat types. Hillebrand et al. (2001) and Green et al. (2004) used species overlap information to assess the similarity of microbes as a function of geographic distance. These authors discovered the distance-decay relationship for microbial assemblages. Thus, the communities that are similar (close geographically and similar habitat) would generally have more overlap than one farther apart. How to incorporate covariate information in the estimation of shared species richness merits more research.
Boulinier et al. (1998) pointed out a simple analogy between the species replicated incidence data in a community and capture-recapture studies of a closed population. Thus, the estimation of species richness in a community based on replicated incidence data is equivalent to the estimation of the size of a population based on capture-recapture data. The analogy can be extended to the general case of multiple communities. That is, the estimation of shared species richness based on multiple incidence data from each community is equivalent to the estimation of the size of a shared population based on capture-recapture data from each population. Consequently, the proposed methodology for replicated incidence data can be directly applied to estimate the size of a shared population. This application and relevant topics are currently under investigation; see Chao, Pan, and Chiang (2008).
This work was supported by the Taiwan National Science Council (Project 96-2118-M007-001) to HYP and AC, and by the Austrian Science Foundation (FWF project P19699-B17) to WF. Part of the material is based on the Ph.D. work of the first author under the supervision of the second author. The authors thank the Editor, Associate Editor, and two reviewers for carefully reading the manuscript and providing very thoughtful comments and suggestions, which significantly improved the article.