Metagenomic projects (e.g. [9
]) often survey both genomic DNAs and 16S rRNAs. The later are used to estimate the microbial diversity, which is often quantitatively described in OTUs. Because of read length limitation, it is not practical to sequence the full length of 16s rRNA (~1.5
kb), so 16s rRNA studies often use individual variable regions (V1–V9) or sections that cover a few variable regions (e.g. V1–V3 and V3–V5). Pyrosequencing of 16S rRNA amplicons has been the dominant approach in rRNA studies. Finding OTUs from 16S rRNA tags can be readily addressed by clustering. Conventionally, tags with ≥97% identity are placed in the same OTUs at the species level. CD-HIT [29
] and DOTUR [49
] were often used for OTU clustering during early studies.
However, a big problem in OTU analysis is that directly clustering the raw rRNA reads or even the high-quality reads often greatly over-estimates the diversity. A recent review [50
] analyzed a list of methods and discussed solving this problem at the clustering algorithm level. This article suggested using average linkage-based hierarchical clustering methods such as ESPRIT [51
], instead of greedy incremental methods such as CD-HIT [29
] and Uclust [39
] for OTU clustering.
In the meantime, many other studies [52–56
] found that the single biggest cause of the over-estimation problem is the sequence errors or noise, so new methods such as SLP [52
], PyroNoise [54
], Denoiser [55
] and Ampliconnoise [56
] focus at identifying and removing sequence noise. All these methods find sequence errors by clustering analysis and are based on a principle that a high-abundance cluster can recruit small clusters and singletons, which have more sequence errors. SLP clusters the actual rRNA tags, and the rest of the methods cluster the original flowgram data. Currently, the best performing method among them is AmpliconNoise [56
], which has been benchmarked by several commonly used Mock data sets; these data sets are artificial mixtures of 16S rRNA clones at different abundance levels from a number of known species.
Although the speed of AmpliconNoise is considerably improved over its predecessor version (PyroNoise), it is still quite computational intensive. Recently, CD-HIT-OTU was introduced to the CD-HIT package. CD-HIT-OTU also uses a multi-step clustering method to remove reads with sequence errors and achieves results comparable with AmpliconNoise. However, as CD-HIT-OTU clusters sequences instead of flowgram data and inherits unique heuristics from CD-HIT, it is orders of magnitude faster than AmpliconNoise and other methods such as Denoiser. lists the performance of CD-HIT-OTU, AmpliconNoise and Denoiser (implemented in QIIME [57
]) on clustering the Mock benchmark data sets [56
] at 97% identity level.
Accuracy and speed for OTUs identificationa
CD-HIT-OTU has following steps: (i) the raw reads with ambiguous base calls are removed. Reads are also removed if their 5′-ends do not match user-provided primer sequence or a consensus, which is built from the 5′ of all reads of k bases (k
6 by default, adjustable by users). For long reads, it also trims off the tails portion at 3′-ends that are beyond median read length. (ii) Processed reads are clustered at 100% identity using CD-HIT-DUP. At this step, the reads from a unique rRNA template will form one large primary cluster (it contains error-free reads) and some small clusters, which contain reads with sequence errors. (iii) The representative sequences from step 2 are sorted by abundance and then clustered by CD-HIT-EST at a threshold that allows up to two mismatches. For example, 200-bp reads are clustered at 99.0% identity, so that small clusters are recruited into their primary clusters. (iv) Let x
to be the median size of small clusters recruited into the most abundant primary cluster with two mismatches. Clusters smaller than x
are dominated by reads with more than two errors from the most abundant template; so these clusters are removed. Herein, x
is often very small (2 or 3), so that rare species will still be kept in the analysis. (5) The remaining representative sequences from step 2 are clustered into OTUs using CD-HIT-EST (parameters: -c 0.97 -n 10 -l 11 -p 1 -d 0 -g 1). Herein, option ‘-c 0.97’ means 97% identity. (6) The non-representative tags are recruited into the OTUs using CD-HIT-EST-2D (parameters: -c 0.97 -n 10 -l 11 -p 1 -d 0 -g 1).
The ultra-high speed of CD-HIT-OTU allows clustering multi-million rRNA tags pooled from a series of related samples. Such clustering can significantly increase the accuracy of OTU identification, because tags shared by different samples validate each other. Clustering pooled samples may identify very rare OTUs, which may be missed if individual samples are processed independently. We applied CD-HIT-OTU on two pooled data sets, Human_gut_V6 [48
] and Human_body_V2 [44
]; these include 33 gut samples from obese and lean twin families and 815 samples from different body sites, respectively (). CD-HIT-OTU only used a few minutes for these two data sets.
OTU analysis for pooled human gut and human samples
In this analysis, we found that clustering the pooled samples identified 19–80 more rare OTUs than clustering individual samples for the 33 human gut data sets. For the 815 human body data sets, clustering pooled samples found up to 50 more rare OTUs. Clustering pooled samples also provides a very straightforward way to define a ‘core microbiome’ and to compare the diversity and composition of samples. For example, we calculated NAT50 for each sample. Herein, NAT50 is a diversity indicator we defined, which stands for the number of most abundant taxonomic groups covering 50% populations. shows that obese samples have less diversity than lean samples. Please note the abundance of OTUs is the abundance of rRNA genes and may not be the abundance of species, because the rRNA copy numbers are unknown. However, rRNA genes abundance largely correlates with species abundance. The full results for human gut and human body are also available as examples with the CD-HIT-OTU software, which is available from http://weizhongli-lab.org/cd-hit-otu
. CD-HIT-OTU is also available as a web server within WebMGA [43
], a collection of web servers for metagenomic data analysis.
Figure 1: Distribution of microbial diversity measured by NATs (NAT20, NAT50, NAT80 and NAT99) for 33 human gut samples. The x-axis is NAT category. The y-axis is NAT value. Samples are colored by sample type (obese, over weight or lean). The results show that (more ...)