Multiple sequence alignment (MSA) is a fundamental task in molecular biology. An MSA is a prerequisite for virtually all comparative sequence analyses, including phylogeny reconstruction, functional motif or domain characterization, sequence-based structural alignment, inference of positive selection, and profile-based homology searches. All such analyses take the MSA input for granted, regardless of uncertainties in the alignment. Because errors in upstream methodology tend to cascade downstream, alignment errors are an important concern in molecular data analysis.
In the last decade, considerable efforts have been made to improve alignment accuracy (e.g., Notredame et al. 2000
; Edgar 2004
; Katoh et al. 2005
; Loytynoja and Goldman 2008
). Nevertheless, benchmark studies show that obtaining accurate alignments remains a challenging task. In such studies, a reference MSA is assumed to be the “true” alignment, the sequences are realigned using the MSA algorithm of interest, and the reconstructed MSA is compared with the reference MSA. In the Benchmark Alignment dataBASE (BAliBASE) benchmark database, for example, the reference alignment is based on superimposition of protein structures (Thompson et al. 2005
). Alternatively, simulations of sequence evolution can provide a set of sequences with a known history of insertions and deletions along a known evolutionary tree (e.g., Nuin et al. 2006
). The most widely used measures for the agreement of a reconstructed MSA with the reference are the column score (CS), which is the percentage of alignment columns in the reference alignment that were accurately reconstructed, and the sum-of-pairs score (SP), which is the percentage of pairs of aligned residues in the reference MSA that are similarly aligned in the reconstructed MSA (Carrillo and Lipman 1988
; Thompson et al. 1999
). A recent evaluation of SPs across the BAliBASE benchmark concluded that the best alignment programs to date achieve only 76% average accuracy, that is, a quarter of all residue pairs are incorrectly aligned (Nuin et al. 2006
There are several possible sources for errors in sequence alignment. To begin with, all MSA programs use heuristic methods. In contrast to pairwise sequence alignment that can be optimally solved under a given scoring scheme, finding the optimal MSA is computationally prohibitive. Thus, MSA programs usually produce a suboptimal alignment. Furthermore, even with optimal algorithms for pairwise sequence alignment, there are often several co-optimal solutions, that is, different alignments with the same maximal score. This issue affects all state-of-the-art MSA algorithms that are based on the “progressive alignment approach” (Feng and Doolittle 1987
) because they use an optimal pairwise alignment algorithm for iteratively adding sequences to the MSA. Notably, although progressive alignment approaches differ in the manner according to which post-alignment corrections and refinements are made, the progressive alignment step is a critical component in all of them. Landan and Graur (2007
) investigated this source of error and concluded that 80–90% of the columns and 40–50% of aligned residue pairs in a typical MSA are affected by uncertainty due to co-optimal solutions.
An additional point of concern is that the objective functions, which alignment algorithms attempt to maximize, are based on simplified models of the process of molecular sequence evolution. Such approximations may yield high scores for unrealistic alignments. Therefore, even if we had unlimited computational power to find the set of MSAs with the optimal score, we cannot be confident that it includes the true alignment because the true alignment may actually be suboptimal. Additionally, the stochastic nature of sequence evolution introduces noise on top of the signal, and thus, the true evolutionary history will often score less than the highest scoring alignment even if a perfect scoring function were available.
Finally, the alignment may be sensitive to errors in the guide tree, which is used for choosing the order in which the sequences are added to the growing MSA in the progressive alignment approach. Indeed, estimates of guide tree accuracy show that, on average, more than 10% of tree branches are topologically incorrect for data sets of 25 taxa, and this proportion increases with the number of taxa (Nelesen et al. 2008
). Several studies measured alignment accuracy in terms of the percentage of correctly aligned residues by comparing a reconstructed MSA with a reference benchmark MSA (e.g., Nelesen et al. 2008
; Landan and Graur 2009
). These studies concluded that the accuracy of the guide tree has a negligible effect on the accuracy score of the alignment. However, as we will show here, perturbations in the tree affect significant portions of the alignment, shifting residues one way or the other, even though the overall accuracy score does not change significantly. Therefore, we argue that guide tree uncertainty is an important source of alignment uncertainty.
All the above factors contribute to substantial errors in alignments produced by state-of-the-art MSA algorithms. Equally troubling is the fact that, with the notable exception of T-COFFEE (Notredame et al. 2000
), most of the widely used MSA programs do not provide information regarding the reliability of different regions in the alignment, for example, ClustalW, MUSCLE, MAFFT, and PRANK (Thompson et al. 1994
; Edgar 2004
; Katoh et al. 2005
; Loytynoja and Goldman 2008
). Distinguishing between accurate and noisy alignment regions is important for MSA-dependent analyses, which should try to avoid alignment regions of low quality. Only a few confidence measures for alignments have been published. In phylogeny reconstruction, it is common practice to remove alignment blocks suspect of low quality using the Gblocks program, which defines various cutoffs on the number of gapped sequences in an alignment column (Talavera and Castresana 2007
). However, these criteria may excessively filter out regions with insertion/deletion events that can be aligned reliably. A few alignment algorithms output site-specific scores that allow the selection of high-confidence regions. Such a service was first offered by the SOAP program (Loytynoja and Milinkovitch 2001
), which tests the robustness of each column to perturbation in the parameters of the popular alignment program ClustalW. The T-COFFEE Web server (Poirot et al. 2003
) uses a library of alignments in the construction of the final MSA, and its output MSA is colored according to confidence scores that reflect the agreement between different alignments in the library regarding each aligned residue. Another alignment program that can output an MSA with confidence scores is FSA (Bradley et al. 2009
), which uses a statistical model that allows calculation of the uncertainty in the alignment. Similarly, the Heads-or-Tails (HoT) score can be used as a measure of site-specific alignment uncertainty due to the co-optimal solutions problem mentioned above (Landan and Graur 2007
). However, none of these confidence measures account for uncertainties in the guide tree.
Perhaps the most statistically justified approach to assess alignment uncertainty is the use of probabilistic evolutionary models accounting jointly for phylogeny and alignment, as in the programs BEAST and BAli-Phy (Lunter et al. 2005
; Redelings and Suchard 2005
). These methods use a Bayesian approach that allows calculation of posterior probabilities of the estimated phylogeny and alignment, which is a measure of the confidence in these estimates across the whole solution space. In comparison, in the approach presented here and the previously published HoT score, perturbations are made to the input of deterministic alignment algorithms, such as ClustalW, which were not designed to consider suboptimal solutions. Therefore, in theory, we should prefer the Bayesian approach. However, in practice, the Bayesian approach is infeasible for all but the smallest data sets. For example, the README page of the BAli-Phy Web site recommends “using 12 or fewer taxa in order to limit the time required … .” Even for data sets of few taxa, when genome-wide analyses are concerned, the computational burden of Bayesian algorithms may not be affordable. At least in the near future, it is unlikely that the Bayesian approach will be used in more than a small fraction of comparative genetic research.
In this paper, we show that uncertainties in the guide tree have a considerable effect on the robustness of the MSA. Subsequently, we develop a measure quantifying this effect as a confidence score for each column and for each residue in the alignment based on the robustness of their alignment with respect to perturbations in the guide tree. Our measure is based on the bootstrap (BP) method, which is widely used for assigning confidence scores to branches in reconstructed phylogenetic trees. Benchmark studies with BAliBASE as well as with simulated sequences show that our alignment confidence scores are a good predictor of alignment accuracy, significantly improving on the HoT scores. Therefore, we conclude that guide tree uncertainty is an important source of error in sequence alignment and that MSA-based analyses should take into account site-specific confidence scores in order to avoid artifacts.