The simulation program that is most similar to INDELible is DAWG (Cartwright 2005
). Although DAWG does not have some of the advanced features of INDELible, it is possible to simulate data under the same nucleotide-substitution models to make a fair comparison. Thus, we conducted a computer simulation to examine the computational efficiency of the two simulation programs. Sequence data were simulated under the HKY model, with κ
= 2 and base frequencies 0.4 (T), 0.3 (C), 0.2 (A), and 0.1 (G). In the basic model, we set the insertion and deletion rates to λI
= 0.1 per substitution, with the indel length following a negative binomial distribution with r
= 1 and q
= 0.25 (the geometric distribution). The phylogenetic tree was symmetric with 32 taxa, with all branch lengths set to 0.1 substitutions per site. Substitution rates over sites were either constant or follow the gamma distribution with shape parameter α
= 1. The number of replicate data sets is 100. We then explored several variations of the basic simulation scheme to examine the impact of various factors on the simulation efficiency, such as the number of taxa, the insertion–deletion rate ratio λI
, the amount of evolution measured by the branch length, the average indel length, and the sequence length at the root. INDELible (methods 1 and 2) and DAWG were used to generate the data. The results are shown in .
FIG. 2.— Speed comparison between DAWG and INDELible, with and without continuous gamma rate heterogeneity. The basic simulation model is specified by the settings in , whereas one factor is varied to see its impact in each plot. INDELible1 and INDELible2 (more ...)
DAWG is faster than INDELible in simple circumstances, such as simulating short sequences with low insertion rate and small insertions on small trees with few taxa and short branches. However, with the increase in the complexity of the simulation, the time taken by DAWG increases much faster than by INDELible. The exception to this pattern is simulation using INDELible method 2, which is sensitive to the average branch length as longer branches mean simulation of more rounds of exponential-waiting times in the algorithm. However, method 2 has a speed advantage over method 1 and DAWG for simulation under the continuous gamma model of variable rates among sites. Under this model, every site has a distinct rate, so that the transition probability matrix P(t) needs to be calculated for every site on every branch. In contrast, the transition matrix of the jump chain (M in method 2) is the same for all sites and does not need to be calculated for every site, leading to an increase in computational efficiency.
Speed differences between INDELible and DAWG are largely a matter of programing design. Both programs are written in C++, and both programs store sequence information in the vector container from the standard template library. INDELible implements insertions via a modified lookup table whose execution time is mostly independent of the complexity of the simulation but can be slow in very simple simulations. In contrast, DAWG implements insertions via the C++ function vector::insert, the speed of which is proportional to the number of elements inserted (copying) plus the number of elements between the insertion position and the end of the vector (moving).