Simulations were performed to explore the power and performance of k-base encoding as well as the k-base encoding local alignment algorithm (see Methods). Reads were simulated using sequences both with a uniform error-rate, as well as using sequences with an error-rate modeled after real-world data. In this discussion,"1-base encoding" and "no encoding" are used interchangeably. The power to align a sequence with or without SNPs is defined as the fraction of reads where the read sequence is aligned to the reference with the same alignment score as if the sequence were aligned to the correct location (see Methods). The fraction of reads where the read is aligned to call a SNP, and where the original sequence had no SNP, defines the false positive SNP discovery rate. Similarly, the fraction of reads where the read is aligned not to call a SNP, and where the original sequence had a SNP, defines the false negative SNP discovery rate.
Plotted in Figure is the power to align encoded sequences of varying length (25, 50, and 75 bp) with 0-2 SNPs or base substitutions given a fixed uniform error rate for k = 1...5 (see Methods). The power decreases similarly for all read lengths as the error-rate increases given a fixed number of SNPs. Furthermore, the power decreases substantially when the number of SNPs in the sequence is increased at a fixed error-rate and read length. The power of 1-base encoding, observing each base directly, does not diminish as much as k-base encoded sequences (k > 1) when more SNPs are introduced. This is due to SNPs and observational errors being equivalent in the 1-base encoding case. In these simulations, k-base encoding is more powerful when k > 2 for 0 SNPs, when k > 3 for 1 SNP, and when k > 4 for 2 SNPs and longer read lengths. It is important to note that in many cases when the alignment score between the best alignment and correct alignment differ, the decoded base sequences match. Therefore, the false positive and false negative SNP discovery rates are examined later.
Power of k-base encoding. Power calculated as the fraction of reads that correctly align. 10, 000 simulated reads from the E. Coli genome were generated.
To assess the power of k-base encoding utilizing real-world error rates, the color error (encoding error) rates were estimated from a previous run of an ABI SOLiD v2 sequencer (see Methods). The power of aligning such sequences was assessed in the presence of 0-2 SNPs for k = 1...5 (Table ). For sequences with no SNPs, the power of k-base encoding increases as k increases. However in the context of an increasing number of SNPs and for k > 1 the power of k-base encoding is more ambiguous. For one SNP, k = 2 performs more poorly than no encoding, while k > 2 improves on the lower width encodings. For two SNPs, both k = 2 and k = 3 perform more poorly than no encoding, with only k > 3 having better power than no encoding. Thus, error correction is increased with greater k when no SNPs are present. However, if the goal is to find variants, a large enough k must chosen carefully to justify the penalty in performance.
Power of k-base encoding assuming a real-world per-base error-rate
The false positive SNP discovery rate is evaluated for 25, 50, and 75 base-pair reads (Figure ). With no encoding, SNPs and errors are not distinguishable, and therefore k = 1 is omitted from this discussion. As expected, the false positive SNP discovery rate decreases as k increases. Nevertheless, only above a five percent error rate does 2-base encoding begin to find false SNPs, and at approximately ten percent error rate do all encodings considered begin to falsely discover SNPs. Assessed in Figure is the false negative SNP discovery rate. Similarly to Figure , the false positive SNP discovery rate, the false negative SNP discovery rate decreases as k increases. For low error rates, both of the above metrics are either zero (for Figure ) or are less than 20% (for Figure ). Thus, the settings can be interpreted as being conservative, sacrificing power to find true SNPs for decreasing the false positive SNP discovery rate.
False SNP discovery rate for k-base encoding. False positive SNP discovery rate calculated as the fraction of reads that have a SNP call after alignment when no SNP call is expected. 10, 000 simulated reads from the E. Coli genome were generated.
Figure 3 False negative SNP discovery rate for k-base encoding. False negative SNP discovery rate calculated as the fraction of reads that do not call a SNP after alignment when a SNP call is expected. 10, 000 simulated reads from the E. Coli genome were generated. (more ...)
To illustrate the flexibility of k-base encoding to be tuned for specific scenarios, the power of 5 base encoding is examined when the score for a color substitution is varied, and for 25, 50, and 75 base-pair reads with 0 - 2 SNPs. Intuitively, the various color substitution scores correspond to preferring a given number of color errors over a SNP and possibly fewer color errors (see Methods). As the color error score decreases, the encoding begins to prefer decoding with SNPs rather than with color errors. The various scoring schemes allow for a clear trade off in power to detect color errors over the power to detect SNPs (Figure ). For example, the color substitution score of -25 allows the full correction of 50 bp reads with up to a 20% error rate. However, once a SNP is introduced it has zero power. Alter-natively, the color substitution score of -200 finds SNPs in the low error case, but with higher error data the power to detect only the given SNP(s) is confounded as more SNPs are falsely detected. With zero SNPS, color scores of -25, -50, and -75 have almost perfect power.
Flexibility of scoring systems for 5-base encoding. Power of scoring system evaluation for 5 base encoding. 1, 000 simulated reads from the E. Coli genome were generated.
The performance time of k
-base encoding scales exponentially with increasing k
(see Methods). This is confirmed by plotting the timing information from the previous uniform error-rate simulations (Table ). Since the optimizations found in Homer et al. (2009) [10
] are not pursued, increased variability was observed when the number of SNPs increased. However, an exponential increase in running time occurs as the encoding width (k
) increases. This has practical implications, whereby producing empirical data that is k
-base encoded is possible, but is computationally infeasible to decode as the number of short-reads is in the millions, if not billions for typical experiments.
Performance of k-base encoding
Nevertheless, this exponential increase in running time could be significantly reduced at the cost of completeness by using methods initially adopted for protein similarity search and sequence comparison [14
]. A global search strategy is employed to put constraints on the possible search paths thereby significantly reducing the search space. Typically these constraints force the alignment path to proceed along diagonals in the dynamic programming matrix, reducing the dimensionality of the problem. These constraints are specifically used in the implementation of BFAST for both 1-base and 2-base encoded data [10
]. The positions where the read and reference match during the BFAST's global search strategy are annotated and aggregated such that in the local alignment the read and reference must match (i.e. constraining the solution path to pass through a diagonal). In BFAST's case, the global search strategy searches over an 2-base encoded reference for k
= 2 constraining the color transitions. The constraints employed by BFAST could be applied for k
greater than two, yielding significant but unknown speed improvement.