The numbers of species detected in a sample, or of the numbers of organisms discerned at any given phylogenetic level, are strongly affected by the number of sequences analyzed (Schloss and Handelsman, 2005
). Estimates of OTUs increase with number of sequences and a plot of OTUs versus the 1 number of sequences yields a rarefaction curve that approaches a maximum (). To estimate the true maximum value at any phylogenetic level it is necessary to model and extrapolate from the rarefaction curve, or to use nonparametric methods to estimate the true OTU richness by taking into account the population structure. The nonparametric methods provide estimates that also vary with sample size, so their approach to a richness maximum must also be modeled and extrapolated. The results obtained from three methods are presented and they converge.
The effect of the sequencing effort on the estimation of the number of OTUs.
To estimate the ability of the richness estimators used here to predict the number of species and genera in a sample, a collection of 2702 known 16S rRNA genes from 2410 species and 685 genera of bacteria from the Ribosomal Database Project II was aligned using the same section of the gene sequenced from the soil samples (). At 0% dissimilarity, the nonparametric ACE and Chao1 estimators overestimate the number of species while the rarefaction estimator underestimates number of species. At 5% dissimilarity, the rarefaction estimator accurately predicted the number of genera.
Ability of three richness estimators to predict number of species in a sample
Small subunit rRNA gene fragments of 800 bp were amplified from an amount of DNA equivalent to that found in a gram of soil. This size is optimal for 454 pyrosequencing library construction. These fragments were amplified from DNA isolated from each of the soils. Over 149 000 sequences of an average of 103 nucleotides in length were obtained. The number of OTUs present in each sample was determined after defining an OTU at five levels of phylogenetic resolution or sequence similarity (Figures and ). At the highest level of resolution (0% dissimilarity), the maximum number of OTUs in any one soil with any of the three estimators used was just under 52 000 (). Data for 0%, 3%, 5%, 10% and 20% are presented so that the reader can choose a level of discrimination of interest (Table S2
Rarefaction curves depicting the effect of % dissimilarity on the number of OTUs identified. Note the comparatively high species richness of the agricultural samples while the Canadian forest soil has very high phylum richness.
Estimated number of OTUs for each sample using parametric (rarefaction) and nonparametric estimators (Chao1 and ACE) compared to the observed OTUs resolved from the sequences.
The total number of OTUs obtained, and the maximum number estimated by three different methods are shown for five different phylogenetic levels (Tables S2
). A supplementary table includes the standard deviations about these estimates. One of these estimates is parametric and based on rarefaction curves while two of these are based on nonparametric estimates, Chao1 and ACE (Schloss and Handelsman, 2006
). The estimates improve as the definition of an OTU declines in resolution from species, through genera, to something approaching the level of phyla. These data suggest that the original DNA reassociation estimates of 2000–10 000 species per gram were underestimates (Schloss and Handelsman, 2006
; Torsvik et al., 1990
), but these estimates do not approach the higher maximum proposed recently (Gans et al., 2005
Using these OTU richness estimates, the number of sequences required to reach 90% and 95% of the maximum number of OTUs at each % dissimilarity are shown (Table S3
). As a result, we disagree with the statement that ‘the survey size required for accurate analysis of soil communities is impractically large’ (Gans et al., 2005
). In fact, with the latest improvement in throughput in 454 pyrosequencing, this survey can be completed in less than 1 day of operation of the Roche Genome Sequencer FLX system. The maximum number of sequences required to identify 90% of the 52 000 OTUs is less than 713 000. Thus, using the methods here, 95% of all OTUs can be identified with over 10-fold fewer sequences than the number of species suggested recently (Gans et al., 2005
). Further, we present a number of new observations, beyond the estimates of OTU richness.
With over 40% of the total bacterial sequences, the Proteobacteria
represented the dominant phylum in each soil (). The Betaproteobacteria
were the dominant class among the Proteobacteria
in all soils except Brazil where the Gammaproteobacteria
were dominant. With 15%–25% of the bacterial sequences in each sample, the second most abundant phylum in all four soils was the Bacteriodetes
. Other prominent phyla were the Acidobacteria
, the Actinobacteria
and the Firmicutes
. Among the other phyla, the Gemmatimonadetes
represented 3.5% of the sequences in the Canadian sample while in Florida and Brazil, the Nitrospira
represented 3% and 2% of total sequences, respectively. In Illinois and Canada, the Verrucomicrobia
were represented by more than 2% of the sequences. In Illinois, nearly 4% of the sequences were in the TM7 phylum. Approximately 6%–12% of the sequences from each sample remained unclassified. Many unusual bacterial sequences were found that were at least 10% dissimilar to all previously described groups (Figure S1
Figure 4 Relative abundance of phyla and proteobacterial classes for each soil library, in which 16S sequences were classified according to the nearest neighbor in the Greengenes database (http://greengenes.lbl.gov).
The primers used in this work also amplify the 16S rRNA gene from the Archaea
(, Figure S2
). In the agricultural soils, the Crenarchaeota
represented a fairly significant proportion, 4.6%–12.5%, of the total sequences. Of the Crenarchaeota
observed from each of the agricultural soils, about one-third of the sequences are closely related to the ammonia oxidizing archaea. While 16S rRNA gene sequence alone cannot describe the physiology of an organism, it is not unreasonable to expect a high number of ammonia oxidizing archaea in these soils given recent work (Treusch et al., 2005
; Leininger et al., 2006
). The surprising result was the relative dearth of crenarchaeotal sequences in the Canadian sample with just five of 53 251 sequences being in this group. These five sequences were all related to the ammonia oxidizing Archaea
and the low number of these sequences may be attributable to the low pH of this soil. However, the Brazilian soil had an even lower pH and yet over 4% of the total sequences from Brazil were crenarchaeotal. Further work is needed to confirm this observation of a low proportion of crenarchaeotal sequences in forest soils. The cause of this may be that nitrogen is more limiting in forest soils compared to fertilized agricultural soils. The low number of crenarchaeotal sequences in Canada is not substituted by ammonia oxidizing bacterial sequences. Very few sequences from the genera responsible for ammonia oxidation in bacteria were found in each soil ().
Number of sequences classified to be within the four ammonia oxidizing bacterial genera
The patterns of the rarefaction curves for the Canadian forest sample are very different from those of the agricultural soils (Figures and ). In contrast to the situation with archaeal diversity, bacterial diversity is much higher in the forest sample compared to the others. In the Canadian sample, OTU abundance is particularly high at high levels of dissimilarity while in the other samples the diversity is low at high levels of dissimilarity. Conversely at low levels of dissimilarity the OTU abundance of the Canadian sample is lower that that of the other samples. This result is interpreted as the boreal forest sample being very phylum rich while the agricultural samples are phylum poor.