We undertook this study to investigate the relationship between the number of gapped sites in a sequence alignment and the accuracy of phylogenetic inference, and furthermore, to understand the impact of different gap treatment methods, phylogenetic inference methods, the ratio of insertions to deletion events in the evolutionary history of the sequences, and other sequence parameters such as sequence length and the transition-transversion rate ratio, on this relationship. Using the computer program, Dawg version 1.2 [

30] we simulated DNA evolution along a 16-taxon model tree (Figure ), incorporating both nucleotide substitution events and insertion and deletion (indel) events (the latter as a function of the substitution rate.). The resulting DNA alignments were then subjected to three gap treatment methods, namely, MD, BC, and ML

*ε*, and the phylogenetic analysis was done using popular phylogenetic inference methods – distance (NJ), parsimony (MP), likelihood (PhyML) and Bayesian analysis.

A remarkable result in this study is the strong, almost deterministic, dependence of the accuracy of phylogenetic inference on the percentage of gapped sites in the alignment, irrespective of the inference method, gap treatments, or insertion-deletion rate ratio, when the percentage of gapped sites was high (Figure ). This made the assignment of gap thresholds for specific levels of phylogenetic accuracy fairly straightforward, without being necessarily concerned with other determinants of phylogenetic accuracy. It was only at lower gap levels that the relationship was not as straightforward, and other factors (e.g., substitution rate) began to play a part in directly influencing the accuracy of the inferred trees (as evidenced by the contour lines of accuracy crossing gap percentage thresholds in Figure ).

Earlier studies that have compared gap treatment methods have been confined to comparing their relative performances within a given inference method, particularly MP [

5,

6]. Therefore, this study was undertaken to provide users with a comparison of other commonly used inference methods as well. We find that the probabilistic methods are clearly superior to MP and NJ, irrespective of whether gaps are treated as missing data or binary characters. Treating gaps as binary characters implies the assignment of unambiguous phylogenetic signal to them in the evolutionary history of the sequences. Therefore, the number of gaps has little bearing on the distortion of the phylogenetic signal under the BC method. On the other hand, the MD method requires the inference of the missing state at each gapped site (or the summation of the likelihood for all four nucleotides at the gapped sites), a process that is bound to be strained with increasing number of gaps in the alignment. Therefore, it is easy to understand the relative superiority of the BC gap treatment method. It must be noted, of course, that this method can only contribute to phylogenetic accuracy as long as the alignment gaps are known without error (as in this study). Thus, the importance of the accuracy of sequence alignment cannot be underestimated.

The ML

*ε *method performed well in our study, although the Bayesian method was better, especially when the insertion-deletion ratio was 1:3 (Figure ). When compared to MP analysis (the other inference method that incorporated the BC), ML

*ε *was much better when the number of gaps was high, irrespective of the insertion-deletion ratio. Such methods hold the potential for more accurate reconstruction of phylogenies in the presence of large alignment gaps (also see [

15,

31]).

In addition to the MD treatment and the gap-coding treatments such as BC, other treatment methods exist, although not widely used anymore. One of these is pairwise deletion, a gap treatment method that is meaningful only when sequences are compared in a pairwise fashion, as in distance methods of inference, such as NJ. Moreover, it is an extremely rapid method that is suited to the speed of NJ. The other is complete deletion of entire columns of gapped sites from the alignment, which is a gap-treatment method that is applicable to any phylogenetic inference method. We did these analyses as well, because there is sometimes an uncertainty about which of these two methods is better [

2,

32]. The complete deletion of gaps posed a problem in our study as the number of sites that needed to be removed from the alignment, especially at higher substitution rates, caused the remaining sequence length to become so small that often at least one of the four nucleotides failed to be represented in the alignment. Therefore, we used this method only when the substitution rate was very low (

*r *≤ 0.2), and when the alignment length (remaining after complete deletion) for each replicate of a given sequence combination was at least 100 nts.

Since the complete deletion treatment could be used only for low substitution rates, the comparison between the two treatments is also made only across this range. Furthermore, since the pairwise deletion method can only be used in conjunction with the NJ method in this study, we compared the two methods only for NJ. Both methods are comparable at low to moderate gap percentages, but diverge thereafter in the accuracy of phylogenetic inference (not shown). It must also be noted that the gap percentage does not reach very high levels in the pairwise deletion as it does in the complete deletion method. Thus, while for a given gap percentage, the two treatment methods may be comparable in terms of phylogenetic accuracy, the pairwise removal of gaps appears to be better since the gap percentage is much lower with this method.

A comprehensive list and analysis of gap treatment methods may be found in Ogden and Rosenberg [

6] and Simmons Muller and Norton [

5]. However, they did not compare among phylogenetic inference methods, even for those gap-treatment methods that were common to multiple inference methods. In this study, while we do compare among gap-treatment methods, our emphasis is also on comparing among inference methods, insertion-deletion ratios, and the effect of the amount of gap on phylogenetic accuracy under varying parameters.

In order to better understand the influence of the alignment gaps on phylogenetic accuracy, we performed the same simulations, but with only base substitutions and no indels. As there were no gaps in the alignments, the data were subjected to phylogenetic analysis without any processing by means of gap treatment methods. The results of this analysis showed that, as expected, Maximum Likelihood and Bayesian analysis produced the most accurate trees, particularly at the highest substitution rates (not shown).

Another notable finding in this study is the differential influence of insertions and deletions on phylogenetic accuracy. Most of the commonly used gap treatment methods do not distinguish between insertions and deletions. Our results show that phylogenetic accuracy was lower when the insertion-deletion ratio was 1:3. Even the probabilistic methods (PhyML, ML*ε *and Bayesian), which produced the most accurate trees when insertions and deletions were introduced in equal numbers, performed somewhat poorly when the ratio was 1:3 (Figure , , , ). It therefore, appears important to develop methods that first distinguish between insertion and deletion events in the evolutionary history of the sequences in an alignment, and then treat them separately to add distinct signals to the phylogenetic analysis.

In this study, the metric we have used to measure the accuracy against is the percentage of gaps in the alignment, and this in turn has been measured mainly as G/S. Some studies have found that it is not the amount of data missing but rather the amount of data remaining that matters in determining the accuracy of the phylogeny being inferred [

10,

12,

33]. In order to compare our results with the results from these studies, we show the accuracy,

, the remaining number of nucleotides after the gaps are removed from the alignment, and the total length of the alignment resulting from the introduction of indels during the evolution of the sequences, all plotted as a function of G/S (Figure ). The layout of Figure is the same as that of Figure , with the left and right columns referring to insertion-deletion ratios of 1:1 and 1:3, respectively, and the inference methods arranged one below the other, in the same order, namely, NJ, PhyML, ML

*ε*, Bayesian analysis and MP.

One of the first things that stand out in Figure is the general accuracy of the MD method when the G/S is low and poor accuracy when G/S is high, irrespective of the inference method. Interestingly, when the accuracy curve in each graph is compared to the curve of the remaining number of nucleotides, there seems to be little relationship between the two in the left panels (1:1), again, irrespective of the inference method. Thus, even as the number of remaining nucleotides (red triangles) continues to be high for large G/S values, the

value (for MD treatment; dark red circles) plummets down to close to zero. This is because, although the remaining number of nucleotides is high, this is largely a consequence of insertion events having added nucleotides to the sequences. Thus, although there is data, there is little phylogenetic information in it, since homology across sequences at these levels becomes nebulous, at best, leading to low accuracy. On the other hand, the curve of remaining nucleotides itself drops with increase in the G/S value in the right column panels (1:3) – a reflection of the greater proportion of deletion events. Therefore, while the

values drop with increase in G/S in spite of an abundance of data in the left column panels, they do so in the right column panels evidently because of the loss of data as G/S increases (note the scale on the secondary Y axis). Thus, while the remaining amount of data may be an important determinant of accuracy (as in the right column in Figure , and as mentioned in [

10,

12,

33]), this is true only when homology among the sequences in the alignment can be established in the remaining character data. If, however, the remaining character data is largely a result of insertion events, the relationship is unlikely to hold, as seen in the left panels.

On the other hand, if the gaps are coded separately (e.g., as BC), then the phylogenetic signal present in the gaps (if the alignment is accurate) increasingly becomes the only information for the inference method to rely on, as G/S increases. The loss of signal from the character data is reflected in the decreased phylogenetic accuracy at high G/S values (left column). The greater loss of phylogenetic accuracy at medium G/S values in the right column panels of Figure can be attributed to fewer deletion events that are distinct and non-overlapping when compared to insertion events that are more likely to be distinct and non-overlapping, as the increase in the total length of the alignment with indel introduction will be much higher when the insertion-deletion rate ratio is 1:1.

In this study, we also found that the alignments from the random-branching tree yielded essentially the same results as those from the balanced tree, while those from the pectinate tree were different (not shown). The analyses from the pectinate tree data in general showed lower accuracy than the corresponding analyses from the balanced tree datasets. Furthermore, the relative performances of the different inference methods were not the same between the two model topologies. In particular, the relative performance of the PhyML method was worse when the topology contained pectinate branching.

This is a simulation-based study and is confined to certain specific simulation parameters and methods of gap treatment and phylogenetic inference used in this study. However, the choices of the parameter values have been made based on empirical studies in the literature. This included the size distribution of indels as well [

18], which may not be a critical feature as far as the BC treatment is concerned, but may be important when the state is inferred at the gaps or coded. Therefore, we believe that the results obtained in this study are sufficiently general to be useful to the community of molecular phylogeneticists. However, we must add a note of caution that while it is likely that the general results of this study will hold, the particulars may be dependent on the specific choices of simulation and other parameter values. Finally, the relationships between phylogenetic accuracy and gap percentage in this study were derived based on two unlikely events in empirical studies – knowledge of the true tree and a perfect alignment. These certainly are sources of uncertainty and/or error in real data analysis, and must be accounted for, in empirical studies. However, the utility of simulation-based studies such as this is that they serve to provide an assessment and quantification of relationships in the absence of confounding factors.