The potential of high-throughput sequencing as a tool for exploring biological diversity is great, but so too are the challenges that arise in its analysis. These technologies have made possible the characterization of very rare genotypes in heterogeneous populations of DNA at low cost. But when applied to a metagenomic sample, the resulting raw data consist of an unknown mixture of genotypes that are convolved with errors introduced during amplification and sequencing.
There are two broad approaches to high-throughput sequencing of metagenomes: in amplicon sequencing
(also called gene-centric
metagenomics) a pool of DNA for sequencing is produced by using PCR to amplify all the variant sequences in a sample that begin and end with a chosen pair of primers [1
], frequently targeting hypervariable regions of the 16S ribosomal RNA gene [4
]; in de novo genome assembly
total DNA is sequenced without amplification and reads are clustered into “species bins”, each providing the material for genome assembly by shotgun methods (see Table 1 in [5
] for a list of such studies).
By trading off a broad survey of gene content for greater sequencing depth at the sampled loci, amplicon sequencing has the potential to detect the rarest members of the sampled community, but errors interfere more profoundly. Unlike genome assembly projects, where one needs only to determine the consensus base at each locus or decide whether a SNP is present in a population, the space of possible distributions for the sample genotypes and frequencies is effectively infinite. As a result, ambiguities in genome projects can usually be resolved by increasing the amount of data, whereas increasing depth (as much as 106
in recent studies [6
]) increases the number of both real and error-containing sequences and makes the challenge of distinguishing minority variants from errors only greater under amplicon sequencing. Greater depth therefore calls for progressively more sophisticated methods of analysis.
The analysis of amplicon sequence data typically begins with the construction of OTUs (operational taxonomic units), clusters of sequences that are within a cutoff in Hamming distance from one another. OTUs serve to collapse the complete set of sequences into a smaller collection of representative sequences – one for each OTU – and corresponding abundances based on the number of reads falling within each cluster. OTUs were developed as a tool for classifying microbial species, but have also been repurposed to the task of correcting errors; the sequences within an OTU are typically interpreted as a taxonomic grouping without specifying whether the variation within an OTU represents errors or real diversity on a finer scale than that chosen to define the OTU. If the scale of the noise is smaller than that of the clusters, then the construction of OTUs will appropriately group error-containing sequences together with their true genotype. However, as sequencing depth increases, low probability errors outside the OTU radius will start to appear, and will be incorrectly assigned to their own OTU. Early studies using this approach on high-throughput metagenome data sets reported large numbers of low-abundance, previously unobserved genotypes that were collectively dubbed the rare biosphere
]. Later, analyses of control data sets indicated that the diversity estimates in such studies tends to be highly inflated [9
] and that results may lack reproducibility [10
]. The dual purpose of OTUs for correcting errors and for taxonomic grouping is appropriate when the diversity is being sampled at a coarse level, e.g. the frequency of different phlya. However, when probing finer-scale diversity, OTU methods have intrinsically high false positive and false negative rates: they both overestimate diversity when there exist errors larger than the OTU-defining cutoff and cannot resolve real diversity at a scale finer than that (arbitrary) cutoff.
In response, a variety of approaches to disentangling errors from actual genetic variation have been proposed recently [11
]. These include multiple rounds of OTU clustering with different hierarchical methods [11
], utilizing sequence abundance information implicitly by starting new clusters with common sequences [11
], and replacing OTU clustering with an Expectation-Maximization
(EM) approach [13
]. Accuracy has steadily improved, but all methods still fall short of maximizing the information acquired from metagenome data sets.
We believe that the way forward is to model the error process and evaluate the validity of individual sequences in the context of the full metagenomic data set, crucially including the abundances (number of reads) corresponding to each sequence. Major progress in this direction has been made recently by Quince et al
]. In the specific context of pyrosequencing, often used for metagenomics, strings of the same nucleotide (homopolymers) are problematic, and Quince et al
incorporated a model of the distribution of homopolymer light intensities into an Expectation-Maximization
(EM) algorithm, Pyronoise
, which infers the homopolymer lengths of sequencing reads [13
). Later, Quince et al
, an extension of PyroNoise
that includes rates of single-nucleotide substitution errors obtained from training data [14
). These methods were shown to more accurately infer the underlying sample genotypes than other approaches, demonstrating the worth of explicitly modeling errors. However, the methods of Quince et al. have several shortcomings that we would like to rectify: (i) as the size of sequence data sets grows, AmpliconNoise
becomes too slow to use in many applications; (ii) estimation of error rates relies on the existence of training data specific to the PCR and sequencing chemistries used; (iii) differentiation of fine-scale diversity is limited because read abundances are not fully utilized when calculating the distance between sequences and clusters; (iv) the parameters that determine how conservative the algorithm is in inferring real diversity are ad hoc and cannot be tuned without experiment-specific training data.
We build on the error-modeling approach pioneered in AmpliconNoise
by developing a novel algorithm, DADA
, to denoise metagenomic amplicon sequence data that addresses the concerns raised above [15
]. We start with a parametric statistical model of substitution errors. We incorporate this error model into a divisive hierarchical clustering algorithm that groups error-containing reads into clusters consistent with being derived from a single sample genotype. Finally we couple this clustering algorithm with the inference of the error parameters from the clustered data, and perform each step in alternation until both converge. This method is presented below, and is shown to outperform previous methods in both speed and accuracy on several control data sets.