As expected, Simrank was able to search bio-polymer databases in less time than local alignment search tools. Simrank was 10X to 928X faster than the BLAST tools in finding similarities among DNA, RNA and proteins. The rapid delivery of results is enabled by the simplistic calculation requiring no bottleneck alignment steps. Since SSAHA2 employs a hybrid strategy of building pair-wise alignments but only against records achieving significant k-mer identities, it was expected to exhibit speeds between BLAST and Simrank. This prediction was observed in Figure -top where Simrank is shown to be only 1.5X to 158X faster than SSAHA2 when tested against public DNA and RNA datasets. SSAHA2 was unable to search protein databases. Simrank and BLAST lagged behind megaBLAST and SSAHA2 when searching shuffled DNA sequences (i.e. synthetic dataset), but were able to find distant relationships missed by the others. SSAHA2 and megaBLAST require larger seeds to elicit alignments and thus searches terminated quickly. Conversely, Simrank and BLAST examined each 7-mer in each query requiring more compute time but enabling distant similarity reporting.
The method of hit count measurement displayed in Figure -bottom presents serious drawbacks. Similarity scales across the tools are not strictly equivalent (as noted in Figure and in "Usage Considerations"), therefore, a 90% match has not the same meaning in Simrank as it may have in the context of an alignment-based score. Comparison of different scales with a uniform threshold does not convey the true sensitivity of Simrank. In order to more directly address the question of sensitivity, a test was conducted to determine the ability of Simrank to find homologues with 97% identity, a popular cutoff for Operational Taxonomic Unit (OTU) boundaries used in molecular microbial ecology [
20]. Figure demonstrates the capacity of Simrank's similarity measure to find appropriate database subjects with a reasonable number of false positives and false negatives despite the difference in scoring scales. This approach allows calibration of Simrank and definition of appropriate thresholds. For example, to find query-subject pairs with 97% full-alignment identity within the 16S dataset, one could utilize a Simrank k-mer size of 8 and score threshold of 84.6% to realize a true positive rate of 95.00% with a corresponding false positive rate of just 00.05%. This means that Simrank matches with over 84.6% 8-mer identity will cover 95% of the BLAST hits but will also match a very small number of strings not found by BLAST.
Although not included in the Figure , we observed that BLAST and SSAHA2 database formatting procedures are faster than Simrank's. For this reason we suggest using BLAST or SSAHA2 for exploratory sequence comparison since trial-and-error databases can be created and destroyed rapidly, but to select Simrank for persistent datasets where various queries will be compared to a fixed set of sequences. Consequently, the Greengenes web service [
13] utilizes Simrank as the search engine for sequence comparison and taxonomic classification of arbitrary user sequences against a reference data set.
Simrank can run in stand-alone mode or as a PERL module within a simple or complex pipeline. The components are modular so various phases of a pipeline can separately encode databases, initialize search factories in memory, and/or process queries as batches or data streams. Simrank accepts user parameters to filter results by depth and/or percent similarity. This is an advantage in high-throughput environments over BLAST, for instance, since post-processing filtering scripts are not needed.
Simrank may allow recovery of useful information from error-laden sequences. A current problem in the popular pyrosequencing technique is the reporting of long homopolymers not verifiable by traditional sequencing techniques [
31]. Simrank eliminates the effect of sequence discrepancies arising solely from homopolymer exaggeration. For instance, a run of 7 consecutive A's can be recorded as one unique 6mer. Thus, if the only polymorphism differentiating two query sequences is the length of an unsubstantiated homopolymer, their Simrank scores against a database will be equivalent.
While this manuscript was under review, another k-mer leveraging software package, UCLUST/SEARCH [
32] was published. Although it is not open-source and requires a paid license for 64-bit versions or commercial use, it does have potential to be highly useful for rapid k-mer searches as well as sequence alignments.
Usage considerations
From observations summarized in Figure , it is advised that Simrank is not suitable for searching randomly shuffled DNA, marginally suitable for matching proteins or strings of highly variable content such as group I self-splicing introns where similarity is limited to only two short spans [
33]. Simrank is well-suited for searching variants of full-length homologous strings such as 16S rRNA genes, partial-length homologous strings such as those created by Roche-454 sequencing technology, and variants of eukaryotic internal transcribed spacer regions.
Simrank similarity scores are not equivalent to alignment percent similarities. For example, Figure displays differences in similarity scores observed when a single DNA sequence collection [
26] is compared to a reference database using Simrank versus the alignment-based F84 scoring distance [
27]. Alignment identities of 90% can produce Simrank identities of 55-70%, and conversely, Simrank identities of 90% can produce alignment identities of 93-99%. The differences are caused by two factors. First, one sequence may contain repetitive k-mers at disjointed positions leading to a perceived increase in similarity, and second the spatial distribution of mismatches can lead to divergence of Simrank and BLAST scoring. For example if every 1 in 7 bases are mismatched in a pair-wise alignment, then Simrank using 7-mers would report a 0% similarity where BLAST would conclude 86% similarity. Thus, tuning k-mer length to the expected frequency of mismatches may result in application-adapted search sensitivity.
Levels of significance for hits to protein sequences should be established based on known reference sets. Protein strings are generally shorter than gene strings and their similarity patterns are often single conserved amino acid positions separated by one or two variable positions. The search for 4-mer similarities within the GP120 protein dataset revealed this difficulty. The BLASTp alignment procedure, although 28X slower, was nearly twice as sensitive compared to Simrank.
Furthermore, since each k-mer is compared across sequences without regard to their relative position in the sequences, Simrank is insensitive to continuous and non-continuous patterns within the sequence such as sites of potential secondary structure. As with all inter-sequence comparisons, search results decline in significance when comparing a very short versus a long sequence. Users can set lower length limits to avoid misleading match pairs.
As noted in Table , the language search comparison encountered 61 unique characters in the institute names but the complexity was reduced to 46 characters for BLAST and SSAHA2. BLAST and megaBLAST were able to find twice as many matches than Simrank but the significance of these hits are questionable since BLAST's local alignments allow one word such as "University" to produce high-scoring pairs. Of the tools, only Simrank tested the entire string for similarity.
Simrank search results across databases composed of strings with repetitive elements can be refined by setting the k-mer length to exceed the repeat length. Any repetitive k-mers within a string are counted only once since only the unique counts are used to create the quotient. In this case, Simrank percent similarity scores would be inflated relative to BLAST.
Future work
Common tasks in molecular microbial ecology may be facilitated with Simrank. Applications include dataset de-replication, sequence clustering, and rapid classification. In upcoming versions, we plan to provide options to reduce database file sizes and memory requirements for constrained alphabets. For instance non-ambiguous DNA can be encoded with 2 bits for each base instead of 8. To further increase speed during batch queries, a non-redundant strategy will be made available allowing a pre-screen of the batch to identify all unique k-mers before reading offset arrays from disk. This will prevent common k-mers from inducing repetitive file reads. Because strings within biological query sets can often contain similar k-mers, we estimate a >5-fold speed increase. To increase the ability to filter hits from a large databases of various length strings, a significance score can be added which considers the likelihood of a percent similarity score given the number of total unique k-mers in the query-subject comparison. This feature will generally down-weight matches from short strings compared to long strings with equivalent percent k-mer identities. Lastly, Simrank can be extended to store and output the string coordinates where k-mers match, should that become desirable. The computationally intensive k-mer tally procedure was written in C for speed but the IO and formatting is written in PERL for easy adaptations and extensions by computational biologists. It is the authors' intentions that other bioinformaticians will be able to improve the open source code where necessary to meet the needs of their projects. Please contact us if you would like to have your changes reflected in the distributed version.