Home | About | Journals | Submit | Contact Us | Français |

**|**mSphere**|**v.2(2); Mar-Apr 2017**|**PMC5343174

Formats

Article sections

Authors

Related links

mSphere. 2017 Mar-Apr; 2(2): e00073-17.

Published online 2017 March 8. doi: 10.1128/mSphereDirect.00073-17

PMCID: PMC5343174

Ecological and Evolutionary Science

Katherine McMahon, Editor

Katherine McMahon, University of Wisconsin;

Reviewed by Mick Watson, External solicited reviewer and Katherine Pollard, External solicited reviewer

Mick Watson, Roslin Institute, University of Edinburgh;

Address correspondence to Patrick D. Schloss, Email: ude.hcimu@ssolhcsp.

Received 2017 February 9; Accepted 2017 February 15.

Copyright © 2017 Westcott and Schloss.

This is an open-access article distributed under the terms of the Creative Commons Attribution 4.0 International license.

This article has been cited by other articles in PMC.

Assignment of 16S rRNA gene sequences to operational taxonomic units (OTUs) is a computational bottleneck in the process of analyzing microbial communities. Although this has been an active area of research, it has been difficult to overcome the time and memory demands while improving the quality of the OTU assignments. Here, we developed a new OTU assignment algorithm that iteratively reassigns sequences to new OTUs to optimize the Matthews correlation coefficient (MCC), a measure of the quality of OTU assignments. To assess the new algorithm, OptiClust, we compared it to 10 other algorithms using 16S rRNA gene sequences from two simulated and four natural communities. Using the OptiClust algorithm, the MCC values averaged 15.2 and 16.5% higher than the OTUs generated when we used the average neighbor and distance-based greedy clustering with VSEARCH, respectively. Furthermore, on average, OptiClust was 94.6 times faster than the average neighbor algorithm and just as fast as distance-based greedy clustering with VSEARCH. An empirical analysis of the efficiency of the algorithms showed that the time and memory required to perform the algorithm scaled quadratically with the number of unique sequences in the data set. The significant improvement in the quality of the OTU assignments over previously existing methods will significantly enhance downstream analysis by limiting the splitting of similar sequences into separate OTUs and merging of dissimilar sequences into the same OTU. The development of the OptiClust algorithm represents a significant advance that is likely to have numerous other applications.

**IMPORTANCE** The analysis of microbial communities from diverse environments using 16S rRNA gene sequencing has expanded our knowledge of the biogeography of microorganisms. An important step in this analysis is the assignment of sequences into taxonomic groups based on their similarity to sequences in a database or based on their similarity to each other, irrespective of a database. In this study, we present a new algorithm for the latter approach. The algorithm, OptiClust, seeks to optimize a metric of assignment quality by shuffling sequences between taxonomic groups. We found that OptiClust produces more robust assignments and does so in a rapid and memory-efficient manner. This advance will allow for a more robust analysis of microbial communities and the factors that shape them.

**Podcast**: A podcast concerning this article is available.

Amplicon-based sequencing has provided incredible insights into Earth’s microbial biodiversity (1, 2). It has become common for studies to include sequencing millions of 16S rRNA gene sequences across hundreds of samples (3, 4). This is a sequencing depth 3 to 4 orders of magnitude greater than was previously achieved using Sanger sequencing (5, 6). The increased sequencing depth has revealed novel taxonomic diversity that is not adequately represented in reference databases (1, 3). However, the advance has forced reengineering of methods to overcome the rate- and memory-limiting steps in computational pipelines that process raw sequences through the generation of tables containing the number of sequences in different taxa for each sample (7,–10). A critical component of these pipelines has been the assignment of amplicon sequences to taxonomic units that are defined either based on similarity to a reference or operationally based on the similarity of the sequences to each other within the data set (11, 12).

A growing number of algorithms have been developed to cluster sequences into operational taxonomic units (OTUs). These algorithms can be classified into three general categories. The first category of algorithms has been termed closed reference or phylotyping (13, 14). Sequences are compared to a reference collection and clustered based on the reference sequences to which they are similar. This approach is fast; however, the method struggles when a sequence is similar to multiple reference sequences that may have different taxonomies and when it is not similar to sequences in the reference (15). The second category of algorithms has been called *de novo* because they assign sequences to OTUs without the use of a reference (14). These include hierarchical algorithms such as nearest, furthest, and average neighbor (16) and algorithms that employ heuristics such as abundance- or distance-based greedy clustering (AGC or DGC, respectively) as implemented in USEARCH (17) or VSEARCH (18), Sumaclust and OTUCLUST (19), and Swarm (20). *De novo* methods are agglomerative and tend to be more computationally intense. It has proven difficult to know which method generates the best assignments. A third category of algorithm is open-reference clustering, which is a hybrid approach (3, 14). Here, sequences are assigned to OTUs using closed-reference clustering, and sequences that are not within a threshold of a reference sequence are then clustered using a *de novo* approach. This category blends the strengths and weaknesses of the other method and adds the complication that closed-reference and *de novo* clustering use different OTU definitions. These three categories of algorithms take different approaches to handling large data sets to minimize the time and memory requirements while attempting to assign sequences to meaningful OTUs.

Several metrics have emerged for assessing the quality of OTU assignment algorithms. These have included the time and memory required to run the algorithm (3, 20,–22), agreement between OTU assignments and the sequences’ taxonomy (20, 22,–32), sensitivity of an algorithm to stochastic processes (33), the number of OTUs generated by the algorithm (23, 34), and the ability to regenerate the assignments made by other algorithms (3, 35). Unfortunately, these methods fail to directly quantify the quality of the OTU assignments. An algorithm may complete with minimal time and memory requirements or generate an idealized number of OTUs, but the composition of the OTUs could be incorrect. These metrics also tend to be subjective. For instance, a method may appear to recapitulate the taxonomy of a synthetic community with known taxonomic structure but do a poor job when applied to real communities with poorly defined taxonomic structure or for sequences that are prone to misclassification. As an alternative, we developed an approach to objectively benchmark the clustering quality of OTU assignments (13, 15, 36). This approach counts the number of true positives (TPs), true negatives (TNs), false positives (FPs), and false negatives (FNs) based on the pairwise distances. Sequence pairs that are within the user-specified threshold and are clustered together represent TPs, and those in different OTUs are FNs. Those sequence pairs that have a distance larger than the threshold and are not clustered in the same OTU are TNs, and those in the same OTU are FPs. These values can be synthesized into a single correlation coefficient, the Matthews correlation coefficient (MCC), which measures the correlation between observed and predicted classifications and is robust to cases where there is an uneven distribution across the confusion matrix (37). Consistently, the average neighbor algorithm was identified as among the best or as the best algorithm. Other hierarchical algorithms, such as furthest and nearest neighbor, which do not permit the formation of FPs or FNs, respectively, fared significantly worse. The distance-based greedy clustering as implemented in VSEARCH has also performed well. The computational resources required to complete the average neighbor algorithm can be significant for large data sets, and so there is a need for an algorithm that efficiently produces consistently high-quality OTU assignments.

These benchmarking efforts have assessed the quality of the clusters after the completion of the algorithm. In the current study, we developed and benchmarked a new *de novo* clustering algorithm that uses real-time calculation of the MCC to direct the progress of the clustering. The result is the OptiClust algorithm, which produces significantly better sequence assignments while making efficient use of computational resources.

The OptiClust algorithm uses the pairs of sequences that are within a desired threshold of each other (e.g., 0.03), a list of all sequence names in the data set, and the metric that should be used to assess clustering quality. A detailed description of the algorithm is provided for a toy data set in the supplemental material. Briefly, the algorithm starts by placing each sequence either within its own OTU or into a single OTU. The algorithm proceeds by interrogating each sequence and recalculating the metric for the cases where the sequence stays in its current OTU, is moved to each of the other OTUs, or is moved into a new OTU. The location that results in the best clustering quality indicates whether the sequence should remain in its current OTU or be moved to a different or new OTU. Each iteration consists of interrogating every sequence in the data set. Although numerous options are available for optimizing the clusters and for assessing the quality of the clusters within the mothur-based implementation of the algorithm (e.g., sensitivity, specificity, accuracy, F1 score, etc.), the default metric for optimization and assessment is MCC because it includes all four parameters from the confusion matrix (see Fig. S1 and Table S1 in the supplemental material). The algorithm continues until the optimization metric stabilizes or until it reaches a defined stopping criterion.

10.1128/mSphereDirect.00073-17.1#### TEXT S1

Worked example of how OptiClust algorithm clusters sequences into OTUs. Download TEXT S1, PDF file, 1.6 MB.

Copyright © 2017 Westcott and Schloss.

This content is distributed under the terms of the Creative Commons Attribution 4.0 International license.

10.1128/mSphereDirect.00073-17.2#### FIG S1

The OptiClust algorithm is able to effectively cluster sequences into OTUs by minimizing or maximizing numerous metrics. Plot of MCC (A), number of OTUs (B), and execution times (C) for the comparison of output from the OptiClust algorithm when minimizing or maximizing a variety of parameters when applied to four natural and two synthetic data sets. Within mothur, OTU assignments can also be made using other metrics, including minimizing false positives and maximizing the specificity, positive predictive value, and true negatives; however, these all resulted in sequences being assigned to separate OTUs, which resulted in no false positives and the maximum number of true negatives. The error bars indicate the range of values observed for 10 replicates. Download FIG S1, EPS file, 0.8 MB.

Copyright © 2017 Westcott and Schloss.

This content is distributed under the terms of the Creative Commons Attribution 4.0 International license.

10.1128/mSphereDirect.00073-17.4#### TABLE S1

Summary of the average number of true positives, true negatives, false positives, false negatives, and the resulting Matthews correlation coefficient for each of the clustering methods that were analyzed in this study for each of the six data sets. Blank values indicate that those conditions could not be completed in 50 h with 45 GB of RAM. Download TABLE S1, PDF file, 0.02 MB.

Copyright © 2017 Westcott and Schloss.

This content is distributed under the terms of the Creative Commons Attribution 4.0 International license.

To evaluate the OptiClust algorithm and compare its performance to other algorithms, we utilized six data sets including two synthetic communities and four previously published large data sets generated from soil, marine, human, and murine samples (Table 1). When we seeded the OptiClust algorithm with each sequence in a separate OTU and ran the algorithm until complete convergence, the MCC values averaged 15.2 and 16.5% higher than the OTUs using average neighbor and distance-based greedy clustering (DGC) with VSEARCH, respectively (Fig. 1; Table S1). The number of OTUs formed by the various methods was negatively correlated with their MCC value (ρ = −0.47; *P* < 0.001). The OptiClust algorithm was considerably faster than the hierarchical algorithms and somewhat slower than the heuristic-based algorithms. Across the six data sets, the OptiClust algorithm was 94.6 times faster than average neighbor and just as fast as DGC with VSEARCH. The human data set was a challenge for a number of the algorithms. OTUCLUST and Sumaclust were unable to cluster the human data set in less than 50 h, and the average neighbor algorithm required more than 45 GB of RAM. The USEARCH-based methods were unable to cluster the human data using the 32-bit free version of the software that limits the amount of RAM to approximately 3.5 GB. These data demonstrate that OptiClust generated significantly more robust OTU assignments than existing methods across a diverse collection of data sets with performance that was comparable to popular methods.

Description of data sets used to evaluate the OptiClust algorithm and compare its performance to other algorithms^{a}

By default, the mothur-based implementation of the algorithm stops when the optimization metric changes by less than 0.0001; however, this can be altered by the user. This implementation also allows the user to stop the algorithm if a maximum number of iterations is exceeded. By default, mothur uses a maximum value of 100 iterations. The justification for allowing incomplete convergence was based on the observation that numerous iterations are performed that extend the time required to complete the clustering with minimal improvement in clustering (Fig. S2). We evaluated the results of clustering to partial convergence (i.e., a change in the MCC value that was less than 0.0001) or until complete convergence of the MCC value (i.e., until it did not change between iterations) when seeding the algorithm with each sequence in a separate OTU (Fig. 1). The small difference in MCC values between the output from partial and complete convergence resulted in a difference in the median number of OTUs that ranged between 1.5 and 17.0 OTUs. This represented a difference of less than 0.15%. Among the four natural data sets, between 3 and 6 iterations were needed to achieve partial convergence and between 8 and 12 iterations were needed to reach full convergence. The additional steps required between 1.4 and 1.7 times longer to complete the algorithm. These results suggest that achieving full convergence of the optimization metric adds computational effort; however, considering that full convergence took between 2 and 17 min, the extra effort was relatively small. Although the mothur default setting is partial convergence, the remainder of our analysis used complete convergence to be more conservative.

10.1128/mSphereDirect.00073-17.3#### FIG S2

Copyright © 2017 Westcott and Schloss.This content is distributed under the terms of the Creative Commons Attribution 4.0 International license.

The OptiClust algorithm rapidly converges to optimize the Matthews correlation coefficient. The six data sets were clustered into OTUs using the OptiClust algorithm seeking to maximize the Matthews correlation coefficient. This was repeated 10 times for each data set. Download FIG S2, EPS file, 0.3 MB.

By default, the mothur implementation of the OptiClust algorithm starts with each sequence in a separate OTU. An alternative approach is to start with all of the sequences in a single OTU. We found that the MCC values for clusters generated when seeding OptiClust with the sequences as a single OTU were between 0% and 11.5% lower than when seeding the algorithm with sequences in separate OTUs (Fig. 1). Interestingly, with the exception of the human data set (0.2% more OTUs), the number of OTUs was as much as 7.0% lower (mice) than when the algorithm was seeded with sequences in separate OTUs. Finally, the time required to cluster the data when the algorithm was seeded with a single OTU was between 1.5 and 2.9 times longer than if sequences were seeded as separate OTUs. This analysis demonstrates that seeding the algorithm with sequences as separate OTUs resulted in the best OTU assignments in the shortest period of time.

One concern that many have with *de novo* clustering algorithms is that their output is sensitive to the initial order of the sequences because each algorithm must break ties where a sequence could be assigned to multiple OTUs. An additional concern specific to the OptiClust algorithm is that it may stabilize at a local optimum. To evaluate these concerns, we compared the results obtained using 10 randomizations of the order that sequences were given to the algorithm. The median coefficient of variation across the six data sets for MCC values obtained from the replicate clusterings using OptiClust was 0.1% (Fig. 1). We also measured the coefficient of variation for the number of OTUs across the six data sets for each method. The median coefficient of variation for the number of OTUs generated using OptiClust was 0.1%. Confirming our previous results (15), all of the methods that we tested were stable to stochastic processes. Of the methods that involved randomization, the coefficient of variation for MCC values was considerably smaller with OptiClust than with the other methods, and the coefficient of variation for the number of OTUs was comparable to the other methods. The variation observed in clustering quality suggested that the algorithm does not appear to converge to a locally optimum MCC value. More importantly, the random variation does yield output of a similarly high quality.

Although not as important as the quality of clustering, the amount of time and memory required to assign sequences to OTUs is a legitimate concern. We observed that the time required to complete the OptiClust algorithm (Fig. 1C) paralleled the number of pairwise distances that were smaller than 0.03 (Table 1). To further evaluate how the speed and memory usage scaled with the number of sequences in the data set, we measured the time required and maximum RAM usage to cluster 20, 40, 60, 80, and 100% of the unique sequences from each of the natural data sets using the OptiClust algorithm (Fig. 2). Within each iteration of the algorithm, each sequence is compared to every other sequence and each comparison requires a recalculation of the confusion matrix. This would result in a worst-case algorithmic complexity on the order of *n*^{3}, where *n* is the number of unique sequences. Because the algorithm needs to keep track of only the sequence pairs that are within the threshold of each other, it is likely that the implementation of the algorithm is more efficient. To empirically determine the algorithmic complexity, we fitted a power law function to the data in Fig. 2A. We observed power coefficients between 1.7 and 2.5 for the marine and human data sets, respectively. The algorithm requires storing a matrix that contains the pairs of sequences that are close to each other as well as a matrix that indicates which sequences are clustered together. The memory required to store these matrices is on the order of *n*^{2}, where *n* is the number of unique sequences. In fact, when we fitted a power law function to the data in Fig. 2B, the power coefficients were 1.9. Using the four natural community data sets, doubling the number of sequences in a data set would increase the time required to cluster the data by 4- to 8-fold and increase the RAM required by 4-fold. It is possible that future improvements to the implementation of the algorithm could improve this performance.

We previously described a heuristic to accelerate OTU assignments where sequences were first classified to taxonomic groups and within each taxon sequences were assigned to OTUs using the average neighbor clustering algorithm (13). This method is similar to open-reference clustering except that in our approach all sequences are subjected to *de novo* clustering following classification, whereas in open-reference clustering only those sequences that cannot be classified are subjected to *de novo* clustering. Our cluster splitting approach accelerated the clustering and reduced the memory requirements because the number of unique sequences was effectively reduced by splitting sequences across taxonomic groups. Furthermore, because sequences in different taxonomic groups are assumed to belong to different OTUs, they are independent, which permits parallelization and additional reduction in computation time. Reduction in clustering quality is encountered in this approach if there are errors in classification or if two sequences within the desired threshold belong to different taxonomic groups. It is expected that these errors would increase as the taxonomic level goes from kingdom to genus. To characterize the clustering quality, we classified each sequence at each taxonomic level and calculated the MCC values using OptiClust, average neighbor, and DGC with VSEARCH when splitting at each taxonomic level (Fig. 3). For each method, the MCC values decreased as the taxonomic resolution increased; however, the decrease in MCC was not as large as the difference between clustering methods. As the resolution of the taxonomic levels increased, the clustering quality remained high, relative to clusters formed from the entire data set (i.e., kingdom level). The MCC values when splitting the data sets at the class and genus levels were within 98.0 and 93.0%, respectively, of the MCC values obtained from the entire data set. These decreases in MCC value resulted in the formation of as many as 4.7 and 22.5% more OTUs, respectively, than were observed from the entire data set. These errors were due to the generation of additional false negatives due to splitting similar sequences into different taxonomic groups. For the data sets included in the current analysis, the use of the cluster splitting heuristic was probably not worth the loss in clustering quality. However, as data sets become larger, it may be necessary to use the heuristic to cluster the data into OTUs.

Myriad methods have been proposed for assigning 16S rRNA gene sequences to OTUs. Each claims improved performance based on speed, memory usage, representation of taxonomic information, and number of OTUs. Each of these metrics is subjective and does not actually indicate the quality of the clustering. This led us to propose using the MCC as a metric for assessing the quality of clustering, *post hoc*. Here, we described a new clustering method that seeks to optimize clustering based on an objective criterion that measures clustering quality in real time. In the OptiClust algorithm, clustering is driven by optimizing a metric that assesses whether any two sequences should be grouped into the same OTU. The result is clusters that are significantly more robust and is efficient in the time and memory required to cluster the sequences into OTUs. This makes it more tractable to analyze large data sets without sacrificing clustering quality, as was previously necessary using heuristic methods.

The cluster optimization procedure is dependent on the metric that is chosen for optimization. We employed the MCC because it includes the four values from a confusion matrix. Other algorithms, such as the furthest neighbor and nearest neighbor algorithms, minimize the number of FPs and FNs, respectively; however, these suffer because the numbers of FNs and FPs are not controlled, respectively (13, 16). Alternatively, one could optimize based on the sensitivity, specificity, or accuracy, which is each based on two values from the confusion matrix, or could optimize based on the F1 score, which is based on three values from the confusion matrix. Because these metrics do not balance all four parameters equally, it is likely that one parameter will dominate in the optimization procedure. For example, optimizing for sensitivity could lead to a large number of FPs. A higher number of FPs increases the number of OTUs, while a higher number of FNs collapses OTUs together. It is difficult to know which is worse since community richness and diversity are linked to the number of OTUs. In addition, increasing the number of FNs would overstate the differences between communities while increasing the number of FPs would overstate their similarity. Therefore, it is important to jointly minimize the number of FPs and FNs. With this in mind, we decided to optimize utilizing the MCC. It is possible that other metrics that balance the four parameters could be developed and employed for optimization of the clustering.

The OptiClust algorithm is relatively simple. For each sequence, it effectively asks whether the MCC value will increase if the sequence is moved to a different OTU, including creating a new OTU. If the value does not change, it remains in the current OTU. The algorithm repeats until the MCC value stabilizes. Assuming that the algorithm is seeded with each sequence in a separate OTU, it does not appear that the algorithm converges to a local optimum. Furthermore, execution of the algorithm with different random number generator seeds produces OTU assignments of consistently high quality. Future improvements to the implementation of the algorithm could provide optimization to further improve its speed and susceptibility to find a local optimum. Users are encouraged to repeat the OTU assignment several times to confirm that they have found the best OTU assignments.

Our previous MCC-based analysis of clustering algorithms indicated that the average neighbor algorithm consistently produced the best OTU assignments, with the DGC-based method using USEARCH also producing robust OTU assignments. The challenge in using the average neighbor algorithm is that it requires a large amount of RAM and is computationally demanding. This led to the development of a splitting approach that divides the clustering across distinct taxonomic groups (13). The improved performance provided by the OptiClust algorithm likely makes such splitting unnecessary for most current data sets. We have demonstrated that although the OTU assignments made at the genus level are still better than those of other methods, the quality is not as good as that found without splitting. The loss of quality is likely due to misclassification because of limitations in the clustering algorithms and reference databases. The practical significance of such small differences in clustering quality remain to be determined; however, based on the current analysis, it does appear that the number of OTUs is artificially inflated. Regardless, the best clustering quality should be pursued given the available computer resources.

The time and memory required to execute the OptiClust algorithm scaled proportionally to the number of unique sequences raised to the second power. The power for the time requirement is affected by the similarity of the sequences in the data set, with data sets containing more-similar sequences having a higher power. Also, the number of unique sequences is the basis for the amount of both time and memory required to complete the algorithm. Both the similarity of sequences and the number of unique sequences can be driven by the sequencing error, since any errors will increase the number of unique sequences and these sequences will be closely related to the perfect sequence. This underscores the importance of reducing the noise in the sequence data (7). If sequencing errors are not remediated and are relatively randomly distributed, then it is likely that the algorithm will require an unnecessary amount of time and RAM to complete.

The rapid expansion in sequencing capacity has demanded that the algorithms used to assign 16S rRNA gene sequences to OTUs be efficient while maintaining robust assignments. Although database-based approaches have been proposed to facilitate this analysis, they are limited by their limited coverage of bacterial taxonomy and by the inconsistent process used to name taxa. The ability to assign sequences to OTUs using an algorithm that optimizes clustering by directly measuring quality will significantly enhance downstream analysis. The development of the OptiClust algorithm represents a significant advance that is likely to have numerous other applications.

To evaluate the OptiClust and the other algorithms, we created two synthetic sequence collections and four sequence collections generated from previously published studies. The V4 region of the 16S rRNA gene was used from all data sets because it is a popular region that can be fully sequenced with 2-fold coverage using the commonly used MiSeq sequencer from Illumina (7). The method for generating the simulated data sets followed the approach used by Kopylova et al. (34) and Schloss (36). Briefly, we randomly selected 10,000 unique V4 fragments from 16S rRNA gene sequences that were unique from the SILVA nonredundant database (38). A community with an even relative abundance profile was generated by specifying that each sequence had a frequency of 100 reads. A community with a staggered relative abundance profile was generated by specifying that the abundance of each sequence was a randomly drawn integer sampled from a uniform distribution between 1 and 200. Sequence collections collected from human feces (39), murine feces (40), soil (41), and seawater (42) were used to characterize the algorithms’ performance with natural communities. These sequence collections were all generated using paired 150- or 250-nucleotide (nt) reads of the V4 region. We reprocessed all of the reads using a common analysis pipeline that included quality score-based error correction (7), alignment against a SILVA reference database (38, 43), screening for chimeras using UCHIME (9), and classification using a naive Bayesian classifier with the RDP training set requiring an 80% confidence score (10).

In addition to the OptiClust algorithm, we evaluated 10 different *de novo* clustering algorithms. These included three hierarchical algorithms, average neighbor, nearest neighbor, and furthest neighbor, which are implemented in mothur (v.1.39.0) (11). Seven heuristic methods were also used, including abundance-based greedy clustering (AGC) and distance-based greedy clustering (DGC) as implemented in USEARCH (v.6.1) (17) and VSEARCH (v.2.3.3) (18), OTUCLUST (v.0.1) (19), and Sumaclust (v.1.0.20) and Swarm (v.2.1.9) (20). With the exception of Swarm, each of these methods uses distance-based thresholds to report OTU assignments. We also evaluated the ability of OptiClust to optimize to metrics other than MCC. These included accuracy, F1 score, negative predictive value, positive predictive value, false discovery rate, sensitivity, specificity, the sum of TPs and TNs, the sum of FPs and FNs, and the number of FNs, FPs, TNs, and TPs (see Fig. S1 and Table S1 in the supplemental material).

We evaluated the quality of the sequence clustering, the reproducibility of the clustering, the speed of clustering, and the amount of memory required to complete the clustering. To assess the quality of the clusters generated by each method, we counted the cells within a confusion matrix that indicated how well the clusterings represented the distances between the pair of sequences (13). Pairs of sequences that were in the same OTU and had a distance less than 3% were true positives (TPs), those that were in different OTUs and had a distance greater than 3% were true negatives (TNs), those that were in the same OTU and had a distance greater than 3% were false positives (FPs), and those that were in different OTUs and had a distance less than 3% were false negatives (FNs). To synthesize the matrix into a single metric, we used the Matthews correlation coefficient using the sens.spec command in mothur using the following equation:

$$\text{MCC}=\frac{\text{TP}\times \text{TN}-\text{FP}\times \text{FN}}{\sqrt{\left(\text{TP}+\text{FP}\right)\left(\text{TP}+\text{FN}\right)\left(\text{TN}+\text{FP}\right)\left(\text{TN}+\text{FN}\right)}}$$

To assess the reproducibility of the algorithms, we randomized the starting order of each sequence collection 10 times and ran each algorithm on each randomized collection. We then measured the MCC for each randomization and quantified their percent coefficient of variation (% CV; 100 times the ratio of the standard deviation to the mean).

To assess how the memory and time requirements scaled with the number of sequences included in each sequence collection, we randomly subsampled 20, 40, 60, or 80% of the unique sequences in each collection. We obtained 10 subsamples at each depth for each data set and ran each collection (*n =* 50 = 5 sequencing depths × 10 replicates) through each of the algorithms. We used the timeout script to quantify the maximum RAM used and the amount of time required to process each sequence collection (https://github.com/pshved/timeout). We limited each algorithm to 45 GB of RAM and 50 h using a single processor.

The workflow utilized commands in GNU make (v.3.81), GNU bash (v.4.1.2), mothur (v.1.39.0) (11), and R (v.3.3.2) (44). Within R, we utilized the wesanderson (v.0.3.2) (45), dplyr (v.0.5.0) (46), tidyr (v.0.6.0) (47), cowplot (v.0.6.3) (48), and ggplot2 (v.2.2.0.9000) (49) packages. A reproducible version of the manuscript and analysis is available at https://github.com/SchlossLab/Westcott_OptiClust_mSphere_2017.

This work was supported through funding from the National Institutes of Health to P.D.S. (P30DK034933).

S.L.W. designed, implemented, and evaluated the algorithm. P.D.S. designed and evaluated the algorithm. Both authors wrote and edited the manuscript.

This paper was submitted via the mSphereDirect™ pathway.

Mick Watson, Roslin Institute, University of Edinburgh.

Katherine Pollard, Gladstone Institutes.

1. Schloss PD, Girard RA, Martin T, Edwards J, Thrash JC
2016.
Status of the archaeal and bacterial census: an update. mBio
7:e00201-16. doi:.10.1128/mBio.00201-16 [PMC free article] [PubMed] [Cross Ref]

2. Locey KJ, Lennon JT
2016.
Scaling laws predict global microbial diversity. Proc Natl Acad Sci U S A
113:5970–5975. doi:.10.1073/pnas.1521291113 [PubMed] [Cross Ref]

3. Rideout JR, He Y, Navas-Molina JA, Walters WA, Ursell LK, Gibbons SM, Chase J, McDonald D, Gonzalez A, Robbins-Pianka A, Clemente JC, Gilbert JA, Huse SM, Zhou HW, Knight R, Caporaso JG
2014.
Subsampled open-reference clustering creates consistent, comprehensive OTU definitions and scales to billions of sequences. PeerJ
2:e545. doi:.10.7717/peerj.545 [PMC free article] [PubMed] [Cross Ref]

4. The Human Microbiome Project Consortium
2012.
Structure, function and diversity of the healthy human microbiome. Nature
486:207–214. doi:.10.1038/nature11234 [PMC free article] [PubMed] [Cross Ref]

5. Eckburg PB, Bik EM, Bernstein CN, Purdom E, Dethlefsen L, Sargent M, Gill SR, Nelson KE, Relman DA
2005.
Diversity of the human intestinal microbial flora. Science
308:1635–1638. doi:.10.1126/science.1110591 [PMC free article] [PubMed] [Cross Ref]

6. Elshahed MS, Youssef NH, Spain AM, Sheik C, Najar FZ, Sukharnikov LO, Roe BA, Davis JP, Schloss PD, Bailey VL, Krumholz LR
2008.
Novelty and uniqueness patterns of rare members of the soil biosphere. Appl Environ Microbiol
74:5422–5428. doi:.10.1128/aem.00410-08 [PMC free article] [PubMed] [Cross Ref]

7. Kozich JJ, Westcott SL, Baxter NT, Highlander SK, Schloss PD
2013.
Development of a dual-index sequencing strategy and curation pipeline for analyzing amplicon sequence data on the MiSeq Illumina sequencing platform. Appl Environ Microbiol
79:5112–5120. doi:.10.1128/aem.01043-13 [PMC free article] [PubMed] [Cross Ref]

8. Schloss PD.
2009.
A high-throughput DNA sequence aligner for microbial ecology studies. PLoS One
4:e8230. doi:.10.1371/journal.pone.0008230 [PMC free article] [PubMed] [Cross Ref]

9. Edgar RC, Haas BJ, Clemente JC, Quince C, Knight R
2011.
UCHIME improves sensitivity and speed of chimera detection. Bioinformatics
27:2194–2200. doi:.10.1093/bioinformatics/btr381 [PMC free article] [PubMed] [Cross Ref]

10. Wang Q, Garrity GM, Tiedje JM, Cole JR
2007.
Naive Bayesian classifier for rapid assignment of rRNA sequences into the new bacterial taxonomy. Appl Environ Microbiol
73:5261–5267. doi:.10.1128/aem.00062-07 [PMC free article] [PubMed] [Cross Ref]

11. Schloss PD, Westcott SL, Ryabin T, Hall JR, Hartmann M, Hollister EB, Lesniewski RA, Oakley BB, Parks DH, Robinson CJ, Sahl JW, Stres B, Thallinger GG, Horn DJV, Weber CF
2009.
Introducing mothur: open-source, platform-independent, community-supported software for describing and comparing microbial communities. Appl Environ Microbiol
75:7537–7541. doi:.10.1128/aem.01541-09 [PMC free article] [PubMed] [Cross Ref]

12. Caporaso JG, Kuczynski J, Stombaugh J, Bittinger K, Bushman FD, Costello EK, Fierer N, Peña AG, Goodrich JK, Gordon JI, Huttley GA, Kelley ST, Knights D, Koenig JE, Ley RE, Lozupone CA, McDonald D, Muegge BD, Pirrung M, Reeder J, Sevinsky JR, Turnbaugh PJ, Walters WA, Widmann J, Yatsunenko T, Zaneveld J, Knight R
2010.
QIIME allows analysis of high-throughput community sequencing data. Nat Methods
7:335–336. doi:.10.1038/nmeth.f.303 [PMC free article] [PubMed] [Cross Ref]

13. Schloss PD, Westcott SL
2011.
Assessing and improving methods used in operational taxonomic unit-based approaches for 16S rRNA gene sequence analysis. Appl Environ Microbiol
77:3219–3226. doi:.10.1128/aem.02810-10 [PMC free article] [PubMed] [Cross Ref]

14. Navas-Molina JA, Peralta-Sánchez JM, González A, McMurdie PJ, Vázquez-Baeza Y, Xu Z, Ursell LK, Lauber C, Zhou H, Song SJ, Huntley J, Ackermann GL, Berg-Lyons D, Holmes S, Caporaso JG, Knight R
2013.
Advancing our understanding of the human microbiome using QIIME. Methods Enzymol
531:371–444. doi:.10.1016/B978-0-12-407863-5.00019-8 [PMC free article] [PubMed] [Cross Ref]

15. Westcott SL, Schloss PD
2015.
De novo clustering methods outperform reference-based methods for assigning 16S rRNA gene sequences to operational taxonomic units. PeerJ
3:e1487. doi:.10.7717/peerj.1487 [PMC free article] [PubMed] [Cross Ref]

16. Schloss PD, Handelsman J
2005.
Introducing DOTUR, a computer program for defining operational taxonomic units and estimating species richness. Appl Environ Microbiol
71:1501–1506. doi:.10.1128/AEM.71.3.1501-1506.2005 [PMC free article] [PubMed] [Cross Ref]

17. Edgar RC.
2010.
Search and clustering orders of magnitude faster than BLAST. Bioinformatics
26:2460–2461. doi:.10.1093/bioinformatics/btq461 [PubMed] [Cross Ref]

18. Rognes T, Flouri T, Nichols B, Quince C, Mahé F
2016.
VSEARCH: a versatile open source tool for metagenomics. PeerJ
4:e2584. doi:.10.7717/peerj.2584 [PMC free article] [PubMed] [Cross Ref]

19. Albanese D, Fontana P, De Filippo C, Cavalieri D, Donati C
2015.
MICCA: a complete and accurate software for taxonomic profiling of metagenomic data. Sci Rep
5:9743. doi:.10.1038/srep09743 [PMC free article] [PubMed] [Cross Ref]

20. Mahé F, Rognes T, Quince C, de Vargas C, Dunthorn M
2014.
Swarm: robust and fast clustering method for amplicon-based studies. PeerJ
2:e593. doi:.10.7717/peerj.593 [PMC free article] [PubMed] [Cross Ref]

21. Sun Y, Cai Y, Liu L, Yu F, Farrell ML, McKendree W, Farmerie W
2009.
ESPRIT: estimating species richness using large collections of 16S rRNA pyrosequences. Nucleic Acids Res
37:e76. doi:.10.1093/nar/gkp285 [PMC free article] [PubMed] [Cross Ref]

22. Cai Y, Sun Y
2011.
ESPRIT-tree: hierarchical clustering analysis of millions of 16S rRNA pyrosequences in quasilinear computational time. Nucleic Acids Res
39:e95. doi:.10.1093/nar/gkr349 [PMC free article] [PubMed] [Cross Ref]

23. Edgar RC.
2013.
UPARSE: highly accurate OTU sequences from microbial amplicon reads. Nat Methods
10:996–998. doi:.10.1038/nmeth.2604 [PubMed] [Cross Ref]

24. Mahé F, Rognes T, Quince C, de Vargas C, Dunthorn M
2015.
Swarm v2: highly-scalable and high-resolution amplicon clustering. PeerJ
3:e1420. doi:.10.7717/peerj.1420 [PMC free article] [PubMed] [Cross Ref]

25. Barriuso J, Valverde JR, Mellado RP
2011.
Estimation of bacterial diversity using next generation sequencing of 16S rDNA: a comparison of different workflows. BMC Bioinformatics
12:473. doi:.10.1186/1471-2105-12-473 [PMC free article] [PubMed] [Cross Ref]

26. Bonder MJ, Abeln S, Zaura E, Brandt BW
2012.
Comparing clustering and pre-processing in taxonomy analysis. Bioinformatics
28:2891–2897. doi:.10.1093/bioinformatics/bts552 [PubMed] [Cross Ref]

27. Chen W, Zhang CK, Cheng Y, Zhang S, Zhao H
2013.
A comparison of methods for clustering 16S rRNA sequences into OTUs. PLoS One
8:e70837. doi:.10.1371/journal.pone.0070837 [PMC free article] [PubMed] [Cross Ref]

28. Huse SM, Welch DM, Morrison HG, Sogin ML
2010.
Ironing out the wrinkles in the rare biosphere through improved OTU clustering. Environ Microbiol
12:1889–1898. doi:.10.1111/j.1462-2920.2010.02193.x [PMC free article] [PubMed] [Cross Ref]

29. May A, Abeln S, Crielaard W, Heringa J, Brandt BW
2014.
Unraveling the outcome of 16S rDNA-based taxonomy analysis through mock data and simulations. Bioinformatics
30:1530–1538. doi:.10.1093/bioinformatics/btu085 [PubMed] [Cross Ref]

30. Sun Y, Cai Y, Huse SM, Knight R, Farmerie WG, Wang X, Mai V
2012.
A large-scale benchmark study of existing algorithms for taxonomy-independent microbial community analysis. Brief Bioinform
13:107–121. doi:.10.1093/bib/bbr009 [PMC free article] [PubMed] [Cross Ref]

31. White JR, Navlakha S, Nagarajan N, Ghodsi MR, Kingsford C, Pop M
2010.
Alignment and clustering of phylogenetic markers—implications for microbial diversity studies. BMC Bioinformatics
11:152. doi:.10.1186/1471-2105-11-152 [PMC free article] [PubMed] [Cross Ref]

32. Al-Ghalith GA, Montassier E, Ward HN, Knights D
2016.
NINJA-OPS: fast accurate marker gene alignment using concatenated ribosomes. PLoS Comput Biol
12:e1004658. doi:.10.1371/journal.pcbi.1004658 [PMC free article] [PubMed] [Cross Ref]

33. He Y, Caporaso JG, Jiang XT, Sheng HF, Huse SM, Rideout JR, Edgar RC, Kopylova E, Walters WA, Knight R, Zhou HW
2015.
Stability of operational taxonomic units: an important but neglected property for analyzing microbial diversity. Microbiome
3:20. doi:.10.1186/s40168-015-0081-x [PMC free article] [PubMed] [Cross Ref]

34. Kopylova E, Navas-Molina JA, Mercier C, Xu ZZ, Mahé F, He Y, Zhou H-W, Rognes T, Caporaso JG, Knight R
2016.
Open-source sequence clustering methods improve the state of the art. mSystems
1:e00003-15. doi:.10.1128/msystems.00003-15 [PMC free article] [PubMed] [Cross Ref]

35. Schmidt TS, Rodrigues JF, von Mering C
2015.
Limits to robustness and reproducibility in the demarcation of operational taxonomic units. Environ Microbiol
17:1689–1706. doi:.10.1111/1462-2920.12610 [PubMed] [Cross Ref]

36. Schloss PD.
2016.
Application of a database-independent approach to assess the quality of operational taxonomic unit picking methods. mSystems
1:e00027-16. doi:.10.1128/msystems.00027-16 [PMC free article] [PubMed] [Cross Ref]

37. Matthews BW.
1975.
Comparison of the predicted and observed secondary structure of t4 phage lysozyme. Biochim Biophys Acta
405:442–451. doi:.10.1016/0005-2795(75)90109-9 [PubMed] [Cross Ref]

38. Pruesse E, Quast C, Knittel K, Fuchs BM, Ludwig W, Peplies J, Glöckner FO
2007.
SILVA: a comprehensive online resource for quality checked and aligned ribosomal RNA sequence data compatible with ARB. Nucleic Acids Res
35:7188–7196. doi:.10.1093/nar/gkm864 [PMC free article] [PubMed] [Cross Ref]

39. Baxter NT, Ruffin MT, Rogers MAM, Schloss PD
2016.
Microbiota-based model improves the sensitivity of fecal immunochemical test for detecting colonic lesions. Genome Med
8:37. doi:.10.1186/s13073-016-0290-3 [PMC free article] [PubMed] [Cross Ref]

40. Schloss PD, Schubert AM, Zackular JP, Iverson KD, Young VB, Petrosino JF
2012.
Stabilization of the murine gut microbiome following weaning. Gut Microbes
3:383–393. doi:.10.4161/gmic.21008 [PMC free article] [PubMed] [Cross Ref]

41. Johnston ER, Rodriguez-R LM, Luo C, Yuan MM, Wu L, He Z, Schuur EA, Luo Y, Tiedje JM, Zhou J, Konstantinidis KT
2016.
Metagenomics reveals pervasive bacterial populations and reduced community diversity across the Alaska tundra ecosystem. Front Microbiol
7:579. doi:.10.3389/fmicb.2016.00579 [PMC free article] [PubMed] [Cross Ref]

42. Henson MW, Pitre DM, Weckhorst JL, Lanclos VC, Webber AT, Thrash JC
2016.
Artificial seawater media facilitate cultivating members of the microbial majority from the Gulf of Mexico. mSphere
1:e00028-16. doi:.10.1128/msphere.00028-16 [PMC free article] [PubMed] [Cross Ref]

43. Schloss PD.
2010.
The effects of alignment quality, distance calculation method, sequence filtering, and region on the analysis of 16S rRNA gene-based studies. PLoS Comput Biol
6:e1000844. doi:.10.1371/journal.pcbi.1000844 [PMC free article] [PubMed] [Cross Ref]

44. R Core Team
2015.
R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria.

45. Ram K, Wickham H
2015.
wesanderson: a Wes Anderson palette generator. https://cran.r-project.org/web/packages/wesanderson/index.html.

46. Wickham H, Francois R
2016.
dplyr: a grammar of data manipulation. https://cran.r-project.org/web/packages/dplyr/index.html.

47. Wickham H.
2016.
tidyr: easily tidy data with “spread()” and “gather()” functions. https://cran.r-project.org/web/packages/tidyr/index.html.

48. Wilke CO.
2016.
cowplot: streamlined plot theme and plot annotations for ‘ggplot2’. https://cran.r-project.org/web/packages/cowplot/index.html.

49. Wickham H.
2009.
ggplot2: elegant graphics for data analysis. Springer-Verlag, New York, NY;
http://ggplot2.org/.

Articles from mSphere are provided here courtesy of **American Society for Microbiology (ASM)**