Providing bootstrap support scores is standard practice in phylogenetic reconstruction from sequence data. However, the classic bootstrap cannot be applied directly to rearrangement data because the collection of permutations forms a single character—a single rearrangement or duplication can affect an entire permutation. In the world of sequence data this is equivalent to an alignment with a single column, albeit one where each character can take any of a huge number of states. Two approaches have been suggested in the past, but do not match the accuracy of the classic bootstrap. One, proposed by Shi et al.
], is a jackknifing technique; the second, proposed by us, uses perturbations in the permutations with known effects on the distance
We design different methods for rearrangement data and devise analogous methods for sequence data (if they do not exist) and vice versa. We study their behavior with both kinds of data with the aim of developing a method for rearrangement data that is as successful as the classic bootstrap is for sequence data. For a method M
that operates on sequence data, we denote by M
the corresponding method for rearrangement data; we use regular font to denote existing methods, bold font to denote the new methods described in this paper.
The methods we present here for rearrangement data rely on our distance estimator
] and so must be used with distance-based reconstruction methods. Our distance estimator computes the estimated true distance between two multichromosomal genomes, replying on statistical inference of only the number of shared adjacencies and the number of linear chromosomes. This limited view of the input data is crucial, as many of the sampling approaches we describe below do not produce valid genome permutations (e.g., because of additional copies of adjacencies), yet still allow us to tally the number of linear chromosomes and of shared adjacencies.
Our robustness estimator based on distance perturbation
], hereafter denoted BP*, permutes each leaf genome through a (randomly chosen) number of random rearrangements, estimates the new pairwise distances, then subtracts from each pairwise estimate the number of rearrangement operations applied to each of the two genomes. The number of operations applied to each genome is chosen from a Gaussian distribution, and so, for each genome, is potentially different. If x
operations are applied to leaf i
to yield leaf i′
operations are applied to leaf j
to yield leaf j′
(where leaves i
are in the inferred tree and leaves i′
in the bootstrap), the expected distance between i′
is increased by (x
) compared to the distance between i
. To keep the expected pairwise distance after perturbation close to the distance between the corresponding pair of leaves before perturbation, we set the final (perturbed) distance B
] for more details. Thus BP* relies on additivity, a property likely to be respected with rearrangement data due to its huge state space. We can design an equivalent for sequence data: for each sequence, apply some random number of possible mutations under a chosen model, then estimate all pairwise distances under the same model, and finally subtract from that estimate the number of mutations applied in the perturbation step to each of the two sequences—a method we denote BP
is less reliable than BP*, as it is much more likely that some of the mutations used in the perturbations cancel each other or cancel some of the mutations on the edit path between the two sequences.
We can view the classical bootstrap for sequence data (hereafter denoted BC) in terms of noise generation. The original multiple sequence alignment gives rise to a distance matrix D. Each replicate dataset created by sampling columns with replacement from the alignment also gives its corresponding matrix B of perturbed pairwise distances. An entry of the replicate matrix corresponding to leaves i and j can thus be written as B(i,j)=D(i,j) + N(i,j) where N(i,j) denotes the perturbation in the distance introduced by the resampling. This noise parameter is hard to characterize exactly, but it leads us to define bootstrapping approaches based on producing increasingly refined estimates of the noise. (In that sense, BP* and BP attempt to shape the noise by returning to the underlying evolutionary process of rearrangement or mutation.)
Bootstrapping by adding Gaussian Noise (hereafter denoted BGN), adds Gaussian noise of mean 0 to each entry in the distance matrix. The standard deviation is empirically determined to match as well as possible the noise added by BC. Since the noise added during the sampling process in BC is not random, this is a very rough estimate, but a useful comparison point. In the replicate matrices produced by BC, the noise N(i,j) depends on the pairwise distance D(i,j), so the next step is to design a bootstrap method based on pairwise comparisons, hereafter denoted BPC. The bootstrap matrix B(i,j) for BPC is constructed by calculating the perturbed pairwise distance for each pair: for each pair of sequences i,j, we construct a new pair of sequences i′,j′by sampling columns with replacement, where each column has only two characters and set B(i,j)=D(i′,j′).
An equivalent method BPC* can be designed for rearrangement data, albeit with some complications. Since our distance estimator relies on the number of shared adjacencies, a natural choice is to sample adjacencies in the genome. While the evolution of a specific adjacency depends directly on several others, independence can be assumed if we assume that once an adjacency is broken during evolution it is not formed again—an analog of Dollo parsimony, but one that is very likely in rearrangement data due to the enormous state space. For each pair of genomes i,j, we construct two new pairs of genomes. We sample adjacencies from genome i with replacement and use only these adjacencies to compute the distance D1(i,j) of leaf i to leaf j. (Note that some adjacencies may be overcounted and some omitted.) Then we sample adjacencies from genome j with replacement and use only these adjacencies to compute the distance D2(i,j) of leaf j to leaf i. Finally, we set B(i,j)=(D1(i,j) + D2(i,j))/2.
The noise N(i,j) may depend not just on the pairwise distance D(i,j), but also on other distances in the tree, since BC samples columns with replacement for all leaf sequences at once. The next step in modeling N(i,j) is thus to sample from all adjacencies (including telomeres). The total number of possible adjacencies (including telomeres) for n syntenic blocks is roughly 2n2, but in a given genome there are at most 2n adjacencies and each adjacency conflicts with at most 4nother adjacencies. Thus, for large genomes, we may assume that adjacencies are independent (if rearrangements happen randomly), just as columns of an alignment are assumed to be independent in BC. We can now mimic closely the sampling procedure of BC in a rearrangement context, producing procedure BC*. From the list of all possible adjacencies, BC* samples with replacement to form a collection of adjacencies; only adjacencies in this collection are then considered in counting the number of shared adjacencies and then estimating the true evolutionary distances between genomes. (Note that some shared adjacencies are counted more than once due to the sampling with replacement).
We know that classical bootstrapping (BC) is comparable in performance to parsimony jackknifing (which we denote PJ) in the sequence world. PJ is (asymptotically) equivalent to sampling with replacement (as in BC), but without overcounting, that is, when sampling gives a column that has been previously selected, it is not added to the replicate. Thus we can obtain the equivalent of PJ for rearrangement data, call it PJ*
: selected adjacencies are not counted more than once for computing the number of shared adjacencies between leaves. Other versions of jackknifing are similarly easy to design. For instance, a d%
JK) omits d%
of the columns to create a replicate, so, from the set of all adjacencies (in all the leaf genomes) a d%
) deletes d%
of the adjacencies at random and only the remaining adjacencies are used in estimating the true pairwise distances. In contrast, the previous jackknifing approach for rearrangement data, developed by Shi et al.
], produces replicates by deleting syntenic blocks from the genome: a d%
-jackknife, in their method, produces a dataset where d%
of the markers are randomly deleted from all leaf genomes. The authors recommend setting d
=40; we call the resulting method JG*. Note that our approach to jackknifing deletes adjacencies instead of markers.
In summary, we have designed a bootstrapping procedure, BC*
, that closely mimics the classic bootstrap for phylogenetic reconstruction, BC, and jackknifing procedures, dJK*
(including, as a special case, PJ*
), that closely mimic the d%
−jackknife (and parsimony jackknife PJ). Along the way, we have also designed less refined versions of bootstrapping and their equivalents for sequence data. In our experiments, we use all of these, plus JG*, the marker-based jackknifing approach of Shi et al.
, plus BP*, our earlier approach based on introducing perturbations, in the permutations, of known effect on the distances
]. A summary of all the methods can be found in Table
A summary of all the methods