DOTUR assigns sequences rapidly and systematically to OTUs by using all possible distances. In both clone libraries that we analyzed, DOTUR assigned sequences to OTUs more accurately and consistently than had previous methods. DOTUR also assists in assessing the completeness of a sequencing effort and the reliability of richness estimates.
Analysis with DOTUR indicates that it is not possible to state with confidence that the richness in the Amazonian library differs from that in the improved Scottish soil library. However, in spite of the relative dearth of sequences in each of these libraries compared to the estimated species richness in 1 g of soil, which is expected to be in the thousands of species (30
), further sequencing might indicate a difference in richness between the libraries. In addition, the application of methods described elsewhere may demonstrate that while the richness between these two libraries is similar, their phylogenetic composition is different (20
). Finally, a connection between species richness or community composition with ecological mechanisms remains to be determined. It is possible that two communities could have considerably different membership yet conduct similar biological processes.
The inclusion of the Sargasso Sea metagenome sequence has provided an interesting application of DOTUR for describing richness, comparing species definitions used for genes with phylogenetic information, and evaluating the level of sampling necessary to have confidence in an estimate. Venter et al. (31
) found 143 different 16S rRNA species and 428 different rpoB
species, and we found 114 and 303 different 16S rRNA and rpoB
species, respectively. This difference may be explained by the fact that they restricted their analysis to those sequences that overlapped by at least 40 bp, while we required all sequences to overlap by the same 300 bp. Regardless of this difference in method, they predicted a minimum number of species of near 1,000 species by using rpoB
sequences and we predicted a richness of 1,040 species by using their 6% difference species definition, suggesting that the methods produce comparable results.
Since DOTUR compares multiple OTU definitions simultaneously, we were able to compare various species level definitions by using 16S rRNA and rpoB
gene sequences. Assuming that the 3% difference in 16S rRNA sequence is a valid definition of a species, a protein coding sequence species definition would then be near 20%. We found similar results with protein coding sequences other than rpoB
that have been used as phylogenetic anchors (data not shown). A value of 20% is more consistent than 6% with previous definitions of species using protein coding sequences. For example, a difference of 30% in DNA-DNA hybridization analysis is used to differentiate between species. Our use of DOTUR accounts for differences in the rate of evolution for these two genes. One potential concern is that any estimates made by using 16S rRNA sequences is inflated, since bacteria are known to have multiple copies of this gene in their genome. Although it is predicted that most slow-growing bacteria that dominate the environment have, on average, close to 1 copy per genome (15
), multiple copies from a single genome would have to be more than 3% different to have an effect on our analysis. If intragenomic variability was greater than 3%, the number of 16S rRNA OTUs would decrease, resulting in an even lower species definition for protein coding sequences. Any distance level that is selected to differentiate species will be arbitrary and consequently controversial, but it will serve as a useful benchmark for future analyses.
When the number of different OTUs observed is less than twice the square root of the total richness, the Chao1 richness estimator is strongly correlated with the sequencing effort. If we assume that there are roughly 4,000 species OTUs in a gram of soil (30
) and 150 in a milliliter of seawater (7
), then at least 125 and 17 different OTUs, respectively, would need to be sampled before the correlation between richness and sequencing effort begins to decrease. However, we do not know how many sequences are required to reach the condition in which there is no correlation between sequencing effort and richness. In soil samples, we demonstrated that 137 sequences were insufficient to estimate richness reliably when distances of 3% were used to define an OTU. Using the Sargasso Sea samples, which is thought to contain one-tenth the richness of soil (7
), we found that a total of 690 sequences was almost sufficient to obtain an accurate estimate of species richness and was sufficient to estimate richness when a 20% difference was used to define an OTU. It is likely that at least 10,000 sequences would be necessary to approach an estimate of the true species richness in soil. To evaluate sampling progress, we suggest tracking the richness estimation collector's curve and sampling until there are no instantaneous 2.5% changes in richness over 300 sequences.
When the Scottish and Amazonian soil sequences were reported, sequencing was quite expensive and laborious. With present technology, which is both less expensive and largely automated, we have the opportunity to generate and sequence large 16S rRNA gene libraries that may be sufficient in size to provide accurate estimates and comparisons of richness even in species-rich environments, such as soil.