Our grouping criteria (see Methods) generated the following subsets. After the substitution criterion, the number of 9-tuples diminished from 262,144 to 29,868 tuples (a circa eight-fold reduction). The following block criterion produced 4206 tuples and finally the refining criterion identified the final set of 1878 9-tuples defining LPS9. In , each line represents 10,000 9-tuples. The number of tuples of the LPS9 in every line changes from 64 to 84. On average, a tuple of the LPS9 is selected by every 144 of the complete set of 262,144 9-tuples. The homogeneous distribution of the LPS9 inside the complete set of 9-tuples allowed us to consider it to be a representative sample.
compiles the average number of tuples sharing identity and/or similarity under four different conditions (see Methods). The optimal scheme is reached at condition C (allowing up to two differences). We consider condition C to be our optimal scheme given that the proportion of 9-tuples with identity and/or similarity with the genomes is between 20% and 80% of the LPS9. Under this scheme, we avoid both underutilization and saturation of LPS9. All subsequent similarity searches (with the exception of the Accuracy of different tree construction methods section) were carried out under condition C. In conditions A and B the LPS9 is underutilized, whereas in condition D it reaches saturation.
Number of LifePrint sets of identical and/or similar 9-tuples (LPS9) under four different similarity search schemesa
The results of the similarity search included the sequences of the tuples that were sharing identity (no differences) or similarity (one or two differences) with subsequences in the genomes.
Each one of 1878 tuples of the LPS9 scrutinized 352 different sequences (one identical one, 27 allowing one difference, and 324 allowing two differences), which means that 661,056 (1878 × 352) sequences of 9-mer are searched. Given that all possible 9-mer sequences are 262,144, it is expected that every sequence would be searched, on average, 2.5 times. In fact, a genome 300 nucleotides in length, ie, containing 292 subsequences 9-mer, is covered by 605 tuples of the LPS9, which corresponds to an average of 2.05 tuples per sequence.
LifePrint has an advantage in comparison with the original k-tuple distance methods, given that, under the optimal similarity search scheme, the tuple length of nine nucleotides provides complete coverage of the analyzed sequences.
shows that in the first 80 nucleotides of the Hop stunt viroid, six 9-mer subsequences were not detected directly by LPS9. The region with minor coverage (from nucleotide numbers 34–46, marked in blue) has a high adenine (A) composition. In addition, this region has three of six subsequences that were not detected directly by LPS9 (beginning at nucleotide numbers 32, 39, and 42). Only two 9-tuples (AATAAAAGA and GAAAAAAAG) shared similarity with this region. Even though LPS9 did not detect all possible 9-mer subsequences directly, our 9-tuple set was capable of fully covering the genomic sequences in this study. Every position within the genomes was recognized by several tuples, which increases the sensibility to detect simple changes. This property is particularly relevant in the case of single nucleotide substitutions.
Detection of single base repeats
Given that the design of LifePrint implies the selection of sequences with a minimum of entropy, LPS9 is not able to identify regions of low complexity (ie, sequence repeats) directly. The results obtained with repeat model sequencing (see Methods) show that two 9-tuples comprising seven with A, one with seven T, two with seven C, and one with seven G (, all in bold type) were capable of detecting subsequences of nine consecutive and identical nucleotides.
How important are low complexity regions to an accurate phylogeny reconstruction? What happen if these regions are not included in the analysis? To answer these questions, it is important to distinguish whether or not the region of interest encodes protein products. Coding or noncoding regions tend to evolve by different mechanisms. When genomic regions encode proteins, even those comprising repeated regions, changes are, in principle, constrained to synonymous substitutions or mutations producing conserved amino acids. In these cases, it is important to consider such regions in the estimation of evolutionary distances, although their impact is also related to the proportion that they represent in the genome, and if they are present in the other genome sequences. However, noncoding regions are not exposed to the same evolutionary pressures as are coding regions. Most mutations are neutral in the noncoding zones, but some substitutions may follow complex evolutionary mechanisms (eg, covariation), such as the case of noncoding sequencing important for other functions (eg, regulation of gene expression).
Here we selected dLog to identify shared tuples between genomes independently of the frequency that tuples are present in the viroid genomes, and the length of the particular genome. shows the results obtained from independent and successive approaches (see Methods section). Under the independent approach we obtained a value of 0 in a variant with a substitution in the end 3′.
k-tuple distance values on single substitutions variantsa
Analysis of variants with substitutions or eliminations located in the 5′ or 3′ ends revealed that only in a few cases were sequence ends presenting punctual changes not detected directly by the LPS9. This result indicates that the ability to discern between variants was not affected. shows k-tuple distance values between variants mentioned above and the Citrus II viroid genome.
k-tuple distance values for variants with single substitutions or eliminations located in the ends of sequencesa
In , the k-tuple average distance for a single substitution presents values from 0.00378 to 0.00390. We examined the list of tuples involved in detecting a simple substitution that implies a k-tuple distance within the mentioned interval. We selected the simulated variant 144 A→G, which presents a k-tuple distance of 0.00390 in relation to the Citrus II viroid genome. In , we list the tuples that detected the substitution A for G in the position 144. It has to be noted that 20 tuples are distinctive in this position, 15 for A and five for G. explains graphically how LPS9 detects efficiently simple substitutions.
Differential detection of variants with a single substitution that implies an average k-tuple distance.a
In order to estimate the limits on the degree of relatedness between two sequences, which putatively will allow us to distinguish between two closely related sequences, the results depicted in indicate that LifePrint reaches saturation at approximately 40% of substitutions. It is expected that when a critical number of variants is included in the phylogenetic study, a given variant considerably distant to another sequence will be closer (eg, more similar) to some other variant. Therefore, the k-distance saturation should not be a limitation for the construction of trees when many strains are included in the analysis.
Ability of LifePrint to distinguish between sequences with different degree of relatednessa
Evaluation of accuracy
In , we summarize the results of symmetric difference comparisons between 12 different NJ trees and the true tree (see Construction of trees in Methods section). These results indicate that both methods based on 9-tuples and 5-tuples, respectively, fail to recover the true tree. These findings illustrate that the metric used for the k-tuple distances results in different accuracies of tree reconstruction. See for a graphic representation of topologic comparisons between the true tree and the LPS9 and 5-tuple NJ trees constructed with dPear.
Symmetric difference values between true tree and neighbor-joining trees constructed from k-tuple distance based on three different distances metricsa
Visual inspection of the NJ trees from the 36 real viroid genomes indicates that the bootstrap support values are higher for trees reconstructed with the LPS9 method than for trees derived from the 5-tuple method. For both 5-tuple trees and LPS9-based trees, low bootstrap proportion values (less than 30%) were observed in those clades inconsistent with the true tree. Therefore, it seems that the bootstrap test can be used confidently as a proxy to evaluate the accuracy of the reconstruction.
and indicate that although both reconstructions are consistent with each other, bootstrap proportion values are higher for the tree based on 9-tuples. The major viroid families were identified by our 9-tuple methods, although some clusters are organized differently, such as the case of the Avsunviroidae members. However, such conflicts are associated with relatively low bootstrap confidence values.
In our view, branch length comparisons between trees are critical when topologies have been estimated using the same optimization criteria. However, in this particular case, we are evaluating only different k-tuple (LifePrint versus 5-tuple) distance methods including trees from character-based methods, such as maximum likelihood or maximum parsimony, which in our assessment would produce uncertain comparisons, given that each optimization criterion reflects different change measurements in the branch lengths. For the purposes of this work, we consider it adequate to constrain the topologic comparison only between trees obtained with k-tuple distance methods.