In the Introduction we highlighted four design criteria which we emphasized in the development of FSA. The first goal was to find alignments with high expected accuracy, where an accurate alignment has minimal distance to the truth. This objective function is markedly different from both the maximum-likelihood approaches used by programs such as ClustalW and MUSCLE and the maximum expected sensitivity approaches used by programs such as ProbCons and Pecan. Note that while the objective function used in ProbCons is called “maximum expected accuracy” in the paper 
, it is actually a maximum expected sensitivity objective function, where there is no penalty for over-aligning sequence (c.f., the results shown in ). Their objective function can be recovered as a special case of our approach by ignoring the gap probabilities in FSA's objective function (Text S1
, “The mathematics of distance-based alignment”). FSA's explicit search for the most accurate, rather than most likely or most sensitive, alignment is what allows FSA to so dramatically outperform most other programs on tests on unrelated sequence ().
We believe that this accuracy criterion, which gives equal weight to the correctness of all sequence positions, is a natural measure of alignment quality. Downstream analyses, such as phylogenetic reconstruction and evolutionary constraint analysis, are increasingly using indels in addition to homologous characters for more accurate estimation (e.g., 
). Thus, it is important that alignments be as “evolutionarily correct” as possible 
, which is the purpose of the accuracy criterion.
FSA's strong performance under the accuracy criterion is due to techniques such as its iterative refinement as well as its explicit attempt to maximize the expected accuracy; programs which explicitly seek to maximize an objective function of the posterior probabilities of character alignment, such as ProbCons or Probalign, could instead seek to maximize the expected accuracy described here and, as a likely result, increase their robustness to non-homologous sequence. However, while we believe that the expected accuracy is a biologically-sensible objective function, it may not be appropriate if the user desires the most sensitive alignment. While FSA can produce the most-sensitive RNA alignments, other programs can produce more sensitive alignments of proteins and genomic sequence, albeit generally at the cost of a tendency to align non-homologous sequence ().
The second goal was to create alignments which are robust to evolutionary distances and different functional constraints on patterns of molecular evolution. FSA's unsupervised query-specific learning for parameter selection frees the user from unknown systematic biases implicitly introduced by using an alignment program whose parameters were trained on a dataset whose statistics may not reflect those of the sequences to be aligned. For example, it is traditionally challenging to align sequences with unusual base composition, but FSA can easily handle such alignments by automatically learning appropriate parameters. As indicated by our ablation analysis, query-specific learning further increases FSA's robustness to non-homologous sequences beyond that offered by the minimum-distance objective function alone.
We believe that FSA's unsupervised query-specific learning is the first time a multiple alignment program has been capable of dynamically learning a complete parameterization, wherein parameters can vary for each pair of sequences to be compared, on the fly. This learning method is related to the “pre-training” option in ProbCons, which permits users to learn different models for families of homologous sequences, but does not permit parameterizations to vary between sequence pairs. We also note that the MORPH program for pairwise alignment of sequences with cis
-regulatory modules learns model transition parameters from data 
. While supervised training on curated data can give superior performance on test sets which are statistically-similar to the training data, the practical alignment problems encountered everyday by biologists do not fit into this rigid problem setup. Query-specific learning consistently gives reasonable performance.
The third and fourth goals, to develop a single, modular program which can address all typical alignment problems encountered by biologists, are naturally achieved within FSA's architecture. While almost all alignment programs are designed to either align many short sequences or a few long sequences, we have demonstrated that it is feasible to create a single program which can address both situations. This is made practical by FSA's modular nature, where the statistical model for pairwise comparisons, the anchoring scheme for finding homology between long sequences, and the sequence annealing procedure are entirely separate and can be individually modified and improved. For example, the parallelization of FSA was designed and developed with minimal changes to the rest of FSA's code base. Similarly, while FSA's basic anchoring relies only on exact matches from MUMmer, the anchoring scheme was easily extended to incorporate inexact matches from exonerate 
and alignment constraints from Mercator 
. In fact, this flexibility permits FSA to incorporate almost any sources of potential homology information, from predicted transcription factor binding sites to entire gene models. One natural extension of FSA's approach is to models of RNA alignment which take structure into account. The program Stemloc-AMA 
uses a model of the pairwise evolution of RNA secondary structure in conjunction with the sequence annealing algorithm to create accurate multiple alignments of structured RNAs. By using Stemloc-AMA's probabilistic model rather than a Pair HMM and taking advantage of techniques such as query-specific learning, FSA could sum over possible pairwise structural alignments in order to get better estimates of posterior probabilities of character alignment.
FSA is a statistical alignment program insofar as it uses an explicit statistical model of alignments and a probabilistic objective function for optimization, but as discussed in “Theoretical justification of distance-based alignment” (Text S1
), it also is a distance-based approximation to the “phylogenetic alignment” models of alignments on trees 
. While traditional phylogenetic alignment algorithms are currently too computationally-expensive to use on datasets of more than a few sequences, FSA's distance-based method allows biologists to use the sophisticated tools of statistical alignment algorithms on practical problems. Furthermore, while we have not addressed the phylogenetic aspects of FSA in detail in this paper, our methods can be adapted to incorporate a complete phylogenetic model (Text S1
, “The mathematics of distance-based alignment”). However, we believe that FSA's current approach, which is agnostic to phylogeny, offers many practical advantages for common genomics analyses. For example, because FSA uses a sum-of-pairs objective function, there is no guide tree to potentially bias downstream phylogenetic reconstructions based on the alignment. Similarly, while most genomic alignment programs require a species tree to perform the alignment, our phylogeny-free approach will be avoid potential biases introduced by using a single species tree to align regions which may have undergone recombination.