Existing TIA algorithms generally consist of two major modules: (i) computing pairwise distances between sequence pairs and (ii) grouping sequences into OTUs at various distance levels. In some algorithms (e.g. ESPRIT [6
], mothur [9
], MUSCLE+DOTUR [19
]) the two steps are performed sequentially, whereas in others (e.g. ESPRIT-Tree [8
], CD-HIT [7
] and UCLUST [10
]) they are performed simultaneously. In this article, we mainly focus on sequence data sets where the number of sequences is on the order of 106
(about the number currently obtained in a single 454 Titanium run). The algorithms implemented sequentially generally cannot handle such massive data because, as an intermediate step, they compute a distance matrix, the size of which is proportional to the square of the number of sequences. However, computational efficiency is not the only consideration. Many existing algorithms are tradeoffs between computational efficiency and accuracy. We present below a brief description of how each algorithm works. summarizes some existing algorithms commonly used by the microbial community for TIA.
Some existing algorithms commonly used by the microbiology community for TIA
One of the most commonly used methods in the first step (pairwise distance calculations) is multiple sequence alignment (MSA), which incorporates information about sequence homology into the distance calculation [19
]. The optimal MSA is an NP-complete problem (i.e. it has been proven to be a ‘hard’ problem that cannot be solved in reasonable time for large numbers of sequences). Although significant improvement has been made in the last decade to reduce computational complexity of MSA (e.g. MUSCLE [21
] and MAFFT [22
]), it remains computationally intractable to perform MSA on millions of sequences. In addition, the use of MSA to align hypervariable regions of 16
S rRNA gene has not yet been well justified. MSA was originally designed to infer homologous segments of input sequences with the underlying assumption that input sequences share recognizable evolutionary similarities at the level of the primary sequences. This assumption may not hold for 16
S rRNA-based studies that target hypervariable regions of rRNA genes from highly diverse microbial communities. In a simulation study presented below using real 16
S rRNA sequences derived from a seawater sample, we observed that many sequences are from distantly related OTUs with large genetic distances. MSA aims to minimize the sum of pairwise alignment scores. In order to align a large number of highly diverse sequences, the alignment quality of closely related sequences is sacrificed, leading to a severe overestimation of genetic distances and microbial diversities. Moreover, the harder MSA tries to minimize the sum of pairwise scores by aligning unrelated sequences, the larger distances between closely related sequences can be. By definition, the optimal MSA algorithm yields the lowest score. This suggests that MSA is not suitable for analyzing hypervariable regions of rRNA genes to estimate microbial diversity, even if the optimal MSA could be performed.
One possible way to address these issues with MSA is to align input sequences against a prealigned reference database. The representative methods include RDP/Pyro [5
], NAST [23
] and a profile-based algorithm included in the well-known mothur pipeline [9
]. Because a reference database can be maintained off-line, the computational complexity of these algorithms grows only linearly with respect to the number of input sequences. However, unlike generic MSA algorithms, these algorithms share a problem with those used for taxonomy-dependent analysis: their performance depends heavily on the completeness of a reference database, and also on the quality of the alignment of the sequences in that database. Since most bacterial genomes have not been sequenced yet, a large proportion of input sequences from unknown microorganisms may not be able to find significant hits and can only be aligned to distantly related reference sequences, leading to inaccurate estimates of pairwise distances. We performed a simulation study that suggests that although fixed aligners work well for known sequences, they perform poorly for novel sequences that are not closely related to those already represented in a reference database.
CD-HIT, UCLUST and ESPRIT use pairwise sequence alignment (PSA) to obtain optimal pairwise alignments of 16
S rRNA sequences and to compute similarities between sequence pairs. Whereas many MSA algorithms (e.g. MUSCLE) can only be deployed using a single processor, PSA allows for parallel computing and provides much more flexibilities in algorithm design. Moreover, simulation studies have shown that by eliminating heuristics in sequence comparison, PSA provides a much more accurate estimate of microbial richness than MSA [6
]. A frequent criticism of PSA is that currently implemented methods do not take RNA secondary structure information into consideration. However, in the benchmark study presented below, we found that, compared to profile-based MSA algorithms, excluding secondary-structure information does not have a significant impact on clustering performance.
Existing algorithms can also be categorized based on the clustering method they use to group sequences into OTUs at various levels of sequence identity. The two most commonly used methods are hierarchical clustering (HC) and greedy heuristic clustering (GHC). HC is a classic unsupervised learning technique that has been used in numerous biomedical applications [25
]. The major drawback of HC is its high computational and space complexity. The standard implementation has an O(N2)
complexity, where N
is the number of input sequences. DOTUR is probably the first published HC algorithm widely used by the microbiology community [11
]. DOTUR, however, cannot handle the extremely large data sets available today. The main reason it does not scale is that it needs to load a distance matrix into memory before clustering. Given one million reads, a full distance matrix can be as large as 7500
GB. Even if we remove duplicate sequences and sequence pairs that have a pairwise distance larger than a specified value (say 0.1), the resulting distance matrix in a sparse format can still be several hundred gigabytes in size, which is too large to be directly loaded into the memory of most computers. To address this issue, we recently developed a new clustering algorithm, referred to as Hcluster, within the ESPRIT framework to handle large-scale complete-linkage and single-linkage HC operations. Unlike conventional methods, Hcluster groups sequences into OTUs on the fly (i.e. reading one distance at a time), whereas keeping track of linkage information. A large-scale experiment was conducted that showed that Hcluster works well with one million reads [14
]. Hcluster has already been incorporated in the mothur pipeline. One limitation of Hcluster is that one still needs to generate a full or sparse distance matrix before clustering. Although ESPRIT uses complete linkage as the default option, it provides a separate function that allows users to perform average linkage, but its memory footprint is much higher than Hcluster. ESPRIT uses k
mer statistics (i.e. frequencies of ‘words’ of a specified length in the sequence) to remove unnecessary sequence alignments. However, its space and computational complexities remain O(N2).
Greedy heuristic clustering (e.g. CD-HIT [7
] and UCLUST [10
]) processes input sequences one at a time, avoiding the expensive step of comparing all pairs of sequences. Given a predefined threshold, an input sequence is either assigned to an existing cluster if the distance between the sequence and a seed (the sequence representing that cluster) is smaller than the threshold, or becomes a new seed for a new cluster otherwise. Consequently, the computational complexity of greedy heuristic clustering is O(MN)
, where M
is the number of seeds. Usually M<<N
, and hence greedy heuristic clustering is computationally much more efficient than HC. CD-HIT and UCLUST are the only two algorithms that we are aware of that can process millions of reads on a desktop computer. Mathematically speaking, greedy heuristic clustering partitions the input space into a set of closed balls where the distances between sequences and their associated seeds are smaller than a predefined threshold. However, there is no guarantee that the true clustering structure can be recovered from such partitions. Also, because sequences are sequentially processed, adjacent clusters may be overlapped. That is, the distance between a sequence and its assigned seed is not necessarily smaller than the distance between that sequence and any other seed (i.e. if the search were continued, a better match might be found). CD-HIT was originally designed to reduce the size of a large database to speed up a database search [7
]. However, it is not designed for uncovering clustering structures, and the performance for that purpose has not yet been benchmarked. In the study presented below, we show that although CD-HIT and UCLUST run several orders of magnitude faster than a HC algorithm, their ability to group sequences into the correct clusters is much worse.
We recently developed a new online learning-based algorithm, referred to as ESPRIT-Tree [8
], which simultaneously addressed the space and computational issues with conventional HC algorithms. The basic idea is to partition an input sequence space into a set of subspaces using a partition tree constructed using a pseudo-metric, then to recursively refine a clustering structure in these subspaces. As with CD-HIT and UCLUST, ESPRIT-Tree does not need to generate a distance matrix. All of the operations are executed on the fly, and the distances are computed only when they are needed. ESPRIT-Tree achieves a similar accuracy to the standard average linkage HC algorithm but with a computational complexity comparable to CD-HIT and UCLUST.