Understanding the evolution of biological diversity remains a central goal for biologists. By far the greatest challenge lies in characterizing the diversity of prokaryotes. Molecular surveys have revealed a bewildering diversity of prokaryotes, perhaps many billions of species of which 99 per cent are unculturable (Acinas et al. 2004
; Venter et al. 2004
; Schloss & Handelsman 2006
). Yet, prokaryotes harbour a wide diversity of biochemical processes and play a fundamental role in ecosystem function. Many problems of extreme importance to human society hinge on understanding prokaryotic diversity and how it will respond to change.
Enormous progress has been made in prokaryotic systematics over recent years. After ongoing debate over the definition of bacterial species, there is now growing consensus on the range of processes that can cause divergence and the origin of independently evolving units of bacterial diversity. In general, geographical isolation and ecological divergence can cause independent evolution (Cohan 2001
; Barraclough et al. 2003
), leading to discrete units of genetic and phenotypic variation qualitatively similar to species in sexual eukaryotes (reviewed in Hanage et al. 2006
; Achtman & Wagner 2008
). Specific models have been developed to investigate how bacterial diversity patterns are shaped by niche specialization and periodic selection (likely to be especially important when effective population sizes are large and recombination rates are low; Koeppel et al. 2008
) and by reproductive isolation (defined as limits to free recombination among separate groups, Fraser et al. 2007
In addition, the increasing availability of molecular data has provided new opportunities for testing these ideas. Multi-locus data are now used widely to delimit bacterial species. Whole genome sequencing is becoming practical on the scale required for species delimitation, i.e. multiple samples from each of several putative species. However, these approaches remain unfeasible for environmental surveys of unculturable bacteria. Environmental shotgun sequencing projects have used shotgun assembly to infer whole genomes (Venter et al. 2004
), but robust assembly of individual genomes down to the level of different closely related clones within populations is likely to be difficult from such data. Therefore, broad-scale studies of unculturable prokaryotic diversity still rely on single-locus data, notably the 16S ribosomal RNA (rRNA) gene. The ribosomal database project (RDP) online database (Cole et al. 2007
) contained over 700
000 sequences in December 2008. Relying on a single locus has limitations: variation in 16S rRNA might not reflect variation at other loci, especially if there is extensive horizontal gene transfer, and 16S rRNA cannot be used to reveal functional diversity among isolates directly. However, 16S rRNA remains the common currency for studying broad-scale diversity patterns.
The traditional approach for estimating bacterial diversity has been to use a threshold of sequence divergence: individuals separated by more than 3 per cent pairwise distance in 16S rRNA were treated as separate species (Schloss & Handelsman 2006
). Recent work has recommended 1 per cent as more appropriate for the species level (Acinas et al. 2004
). However, there are two problems with assuming a universal threshold. First, substitution rates vary among lineages, meaning that X
per cent in one group may not reflect the same period of divergence as X
per cent in another group. Second, levels of variation within and between species are expected to vary widely among lineages, for example owing to differences in effective population size or speciation rates (Achtman & Wagner 2008
). An alternative is to test directly for genetic clusters indicative of independent evolution. Acinas et al. (2004)
used a graphical approach to demonstrate the presence of genetic clusters in bacterial 16S rDNA from the ocean and varied the level of threshold to infer an appropriate cut-off for the delimitation of those clusters. However, statistical methods are needed that allow species limits to be optimized based on evolutionary predictions and that provide a framework for hypothesis testing and estimation of species richness.
Recent studies have developed evolutionary models for bacterial species delimitation, but they have either focused on multi-locus data (Fraser et al. 2007
), required additional ecological data (Hunt et al. 2008
), or are impractical for analysing thousands of sequences (Koeppel et al. 2008
, electronic supplementary material). These limitations make them unsuited for the extreme case considered here: only single-locus data available for large taxonomic samples. Here, we apply a new method, previously applied only to a few eukaryote clades (Pons et al. 2006
; Fontaneto et al. 2007
), to the delimitation of 16S rDNA sequences from the RDP database. The method considers a general evolutionary model of the effects of independent evolution, whatever its causes, on the shape of a gene tree for a single locus. It requires no information except a gene tree, making it particularly suitable for 16S rDNA data from broad surveys. We use the model to test whether significant genetic clusters are present in 16S rDNA from the RDP database and, if so, whether estimates of numbers of evolutionarily significant units (ESUs) match those using threshold values of 3 or 1 per cent.