Inability to accurately determine which sequences are present in a sample, and hence the abundances of rare taxa, greatly inhibits our ability to infer important ecological parameters such as rank-abundance curves, yet ironically the portion of the rank-abundance curve that can be inferred, i.e. of the common taxa, provides a solution to the conundrum of the expense of denoising. Empirical rank-abundance curves, especially from human-associated samples, tend to be dominated by a relatively small number of abundant taxa. Given this feature of actual microbial communities, performing all-on-all comparisons for clustering is exceedingly inefficient: instead, a subset of reads suffices to identify the common OTUs, which can then be iteratively removed by recruitment to an existing cluster. Consequently, we can rapidly determine the OTUs that are most likely to be abundant, concentrate initially on comparing reads to the small number of abundant OTUs (removing matches from the analysis), and then cluster only the leftover reads representing more divergent sequences.
We can thus reduce the total number of sequence comparisons using empirical features of the abundance distribution of real datasets as follows. First, we devised a fast pre-filter, removing reads that are strict prefixes of other reads, and compute an initial sequence distribution. We then sort the prefix clusters in descending order of abundance, and use this initial distribution to cluster similar reads, comparing each additional unclustered read to the most abundant clusters first because we expect the abundant clusters to yield a larger number of erroneous near-matching reads due to their numerical dominance alone. For a more detailed description of the algorithm, see Supplementary Methods
. A similar method of pre-clustering on the sequence level and subsequent sequence clustering along the abundance distribution has been proposed recently11
The method introduced here is a major improvement over previous flowgram-based denoising routines10
in terms of compute resources, yet retains the advantage that singletons are not discarded entirely, allowing exploration of the rare biosphere12
. Previously, a mid-size 24-core cluster was needed to analyze a small dataset of around 40,000 sequences in around 10 hours. Our method allows the same dataset to be denoised in less than an hour on a single laptop computer (Table S1
). We can also denoise full 454 runs with 500,000 sequences on a mid-size cluster in 1 day. We can thus address questions in community ecology that were previously intractable.
Applying these new methods to the most comprehensive survey of human-associated body habitats yet performed4
, we find that denoising produces a substantial decrease in the diversity both at the OTU level and in terms of the phylogenetic diversity (the total branch length associated with each sample on a phylogenetic tree14
). However, the results from the non-denoised (but filtered) and denoised data are highly correlated (r2
= 0.97, P <10−300
for phylogenetic diversity), suggesting that relative results concerning diversity within each sample are robust to the types of errors introduced by pyrosequencing (). Interestingly, in spite of this high correlation, denoising changes the relative order of OTU richness of individual body habitats. Although the gut exhibits the highest OTU richness without denoising, it falls back into the middle ranks after denoising. This holds true for both Chao1 estimates and the phylogenetic diversity (). The drastic reduction after denoising might be an effect of the sequence composition of the dominant OTUs in the gut (see Supplementary Methods
for a more detailed discussion).
Figure 1 Comparisons of non-denoised data (a–c) to denoised data (d–f) for alpha diversity for the Body Habitat study, and comparisons of beta diversity (g–h). Rarefaction plots of the “Body Habitat” study4 show a 3 to 4 (more ...)
Similarly, when clustering the samples using UniFrac, the non-denoised and denoised reads produce very similar patterns (), reinforcing the point that errors introduced into each sample by noise or chimeras have little effect on beta diversity because they inflate the distances among all samples rather than introducing artifactual similarities between specific pairs of samples15
We conclude that the availability of these new methods will make more accurate assessments of alpha diversity available to a wide range of researchers (especially in conjunction with improved chimera-checking methods such as ChimeraSlayer, http://microbiomeutil.sourceforge.net/
), and will greatly improve our understanding of microbial communities in habitats with scales ranging from global to extremely personal. The efficiency of the new techniques and the fact that they can change conclusions about the relative diversity in different habitats suggests that they should be applied routinely in all pyrosequencing studies where estimates of diversity within each sample are the goal.