|Home | About | Journals | Submit | Contact Us | Français|
Wrote the paper: RKB CD LP. Led the development of FSA, wrote most of the code base, and developed the query-specific learning method: RKB. Redesigned the sequence annealing algorithm, constituted the core development team, and managed the project: RKB CD LP. Developed the GUI: AR. Developed a preliminary version of the GUI: MS. Developed the iterative refinement technique: SJ. Developed the parellelization and database modes: JD CD. Provided advice on the dart library, including its algorithms, programming and software components: IH. Created the FSA webserver: LP.
We describe a new program for the alignment of multiple biological sequences that is both statistically motivated and fast enough for problem sizes that arise in practice. Our Fast Statistical Alignment program is based on pair hidden Markov models which approximate an insertion/deletion process on a tree and uses a sequence annealing algorithm to combine the posterior probabilities estimated from these models into a multiple alignment. FSA uses its explicit statistical model to produce multiple alignments which are accompanied by estimates of the alignment accuracy and uncertainty for every column and character of the alignment—previously available only with alignment programs which use computationally-expensive Markov Chain Monte Carlo approaches—yet can align thousands of long sequences. Moreover, FSA utilizes an unsupervised query-specific learning procedure for parameter estimation which leads to improved accuracy on benchmark reference alignments in comparison to existing programs. The centroid alignment approach taken by FSA, in combination with its learning procedure, drastically reduces the amount of false-positive alignment on biological data in comparison to that given by other methods. The FSA program and a companion visualization tool for exploring uncertainty in alignments can be used via a web interface at http://orangutan.math.berkeley.edu/fsa/, and the source code is available at http://fsa.sourceforge.net/.
Biological sequence alignment is one of the fundamental problems in comparative genomics, yet it remains unsolved. Over sixty sequence alignment programs are listed on Wikipedia, and many new programs are published every year. However, many popular programs suffer from pathologies such as aligning unrelated sequences and producing discordant alignments in protein (amino acid) and codon (nucleotide) space, casting doubt on the accuracy of the inferred alignments. Inaccurate alignments can introduce large and unknown systematic biases into downstream analyses such as phylogenetic tree reconstruction and substitution rate estimation. We describe a new program for multiple sequence alignment which can align protein, RNA and DNA sequence and improves on the accuracy of existing approaches on benchmarks of protein and RNA structural alignments and simulated mammalian and fly genomic alignments. Our approach, which seeks to find the alignment which is closest to the truth under our statistical model, leaves unrelated sequences largely unaligned and produces concordant alignments in protein and codon space. It is fast enough for difficult problems such as aligning orthologous genomic regions or aligning hundreds or thousands of proteins. It furthermore has a companion GUI for visualizing the estimated alignment reliability.
The field of biological sequence alignment is very active, with numerous new alignment programs developed every year in response to increasing demand driven by rapidly-dropping sequencing costs. The list of approximately 60 sequence alignment programs on the wikipedia compilation provides a lower bound on the number of available tools and illustrates the confusing choice facing biologists who seek to select the “best” program for their studies. Nevertheless, the ClustalW program ,, published in 1994, remains the most widely-used multiple sequence alignment program. Indeed, in a recent review of multiple sequence alignment , the authors remark that “to the best of our knowledge, no significant improvements have been made to the [ClustalW] algorithm since 1994 and several modern methods achieve better performance in accuracy, speed, or both.” Therefore, it is natural to ask, “Why do alignment programs continue to be developed, and why are new tools not more widely adopted by biologists?”.
A key issue in understanding the popularity of ClustalW is to recognize that it is difficult to benchmark alignment programs. Alignments represent homology relationships among the nucleotides, or amino acids, of the genomes of extant species, and it is impossible to infer the evolutionary history of genomes with absolute certainty. Comparisons of alignment programs therefore rely on databases of structural alignments for proteins and RNA or on gene loci or simulated data for DNA. Each type of benchmark is vulnerable to manipulation and furthermore may not represent the problem setups which are most relevant to biologists. The result is that biologists are confronted with many programs and publications, but it is frequently unclear which approach will give the best results for the everyday problems which they seek to address.
Adding to the difficulty of selecting software tools is the blur between programs and ideas. Developers of alignment programs make choices about the objective functions to optimize, the statistical models to use, and the parameters for these models, but the relative impact of individual choices is rarely tested . Discordance among programs is frequently noted , but the different architectures of individual programs, and in some cases the lack of open software, makes it difficult to explore novel combinations of existing ideas for improving alignments.
In lieu of these issues, biologists have favored the conservative approach of using the tried and trusted ClustalW program, although they frequently use it in conjunction with software which allows for manual editing of alignments . The rationale behind alignment-editing software is that trained experts should be able to correct alignments by visual inspection and that effort is better expended on manually correcting alignments than searching for software that is unlikely to find the “correct” alignment anyway. Although manual editing approaches may be cumbersome, they have been used for large alignments (e.g., ).
We therefore approached the alignment problem with the following goals in mind:
The “distance-based” approach to sequence alignment, proposed in  and implemented in the protein alignment program AMAP , offers a useful framework for these goals. Much as distance-based phylogenetic reconstruction methods like Neighbor-Joining build a phylogeny using only pairwise divergence estimates, a distance-based approach to alignment builds a multiple alignment using only pairwise estimations of homology. This is made possible by the sequence annealing technique  for constructing multiple alignments from pairwise comparisons.
We have implemented our approach in FSA, a new alignment program described below. We give an overview of the structure of FSA and explain the details of its components below. Text S1 contains detailed instructions for using the FSA program and webserver as well as FSA's companion programs for comparing alignments and working with whole-genome alignments.
Figure 1 shows an overview of the components of the FSA alignment algorithm, described in detail below.
The input to FSA is a set of protein, RNA or DNA sequences. These sequences are assumed to be homologous, although FSA is robust to non-homologous sequence. The output of FSA is a global alignment of the input sequences which is a (local) optima of the expected accuracy under FSA's statistical model.
FSA first performs pairwise comparisons of the input sequences to determine the posterior probabilities that individual characters are aligned (note, however, that it does not yet actually align any sequences). While the number of possible pairwise comparisons is quadratic in the number of sequences being aligned, giving unfavorable runtimes for datasets of many sequences, FSA overcomes this problem by reducing the number of pairs of sequences that are compared. This is accomplished using a randomized approach inspired by the Erdös-Rényi theory of random graphs, thereby drastically reducing the computational cost of multiple alignment.
After obtaining pairwise estimates of homology at the single-character level, FSA uses the sequence annealing technique  to construct a multiple alignment. This approach to alignment seeks to maximize the expected accuracy of the alignment using a steepest-ascent (greedy) algorithm.
The architecture of FSA reflects the inherent modularity of the distance-based approach to alignment. FSA's inference engine uses the flexible HMMoC code-generation tool  and has been parallelized with a separate module, alignments of long sequences are anchored with candidate matches found by the MUMmer suffix trie matching tool  or the exonerate homology-search program , and FSA's sequence annealing algorithm improves on the original algorithm and implementation in AMAP . The stand-alone visualization tool uses statistical information produced by FSA, but otherwise is completely independent.
Each of these components can be improved independently of the others, allowing for rapid future improvements in distance-based alignment. For example, FSA's entire statistical model could easily be altered to incorporate position-specific features or completely replaced with a discriminative or hybrid generative-discriminative model.
The components described here correspond roughly to the simplest mode of operation for FSA, outlined in bold in Figure 1.
FSA accepts FASTA-format input files and produces alignments in multi-FASTA or Stockholm format. The server also outputs PHYLIP and ClustalW formatted files.
Distance-based alignment, relying on pairwise estimations of homology, operates on the posterior probability distributions that characters in two sequences are aligned. FSA uses the standard three or five-state pair hidden Markov model (Pair HMM) shown in Figure 2 to infer these posterior probabilities of alignment, as well as posterior probabilities that characters are unaligned or gapped. The structure of the Pair HMM and its parameters can be controlled through the command-line options (see Text S1 for details).
The standard Forward-Backward algorithm on a Pair HMM has time complexity for two sequences of length .
After calculating the posterior probabilities of alignment for characters in all sequence pairs, that individual characters and are aligned and that a character is gapped to sequence , FSA sorts these probabilities according to a weighting function which gives a hill-climbing procedure which is a steepest-ascent algorithm in the weighting function (Text S1, “The mathematics of distance-based alignment”).
After estimating these posterior probabilities and sorting them, FSA creates a multiple alignment with the sequence annealing technique . Sequence annealing attempts to find the alignment with the minimum expected distance to the truth (), computed for two sequences and as
The distance between two alignments is defined as the number of positions for which they make different homology statements, where the homology statement for is either of the form ( is homologous to ) or ( is not homologous to any position in ) . As a simple count of differing statements of homology (and non-homology), this distance has an intuitive biological interpretation. When one of the alignments is the true alignment, this distance can be thought of as the “evolutionary correctness” of the other, where the correctness of the alignment for each sequence position is equally weighted.
The alignment with the minimum expected distance to the truth is equivalent to the alignment with the maximum expected accuracy,
where we define the accuracy of an alignment with respect to a reference, “true” alignment as the fraction of positions for which they make identical homology statements. In contrast with traditional measures of sensitivity and specificity, accuracy takes into account all positions, rather than just those that are predicted to have a homolog. (Note that it linearly penalizes incorrectly-placed gaps.)
The posterior probabilities over alignments used in the optimization are given by FSA's statistical model (a Pair HMM). FSA extends this definition of an optimal pairwise alignment to an optimal multiple alignment by taking sum-of-pairs over all sequences.
Using this expected accuracy as an objective function for a greedy maximization, sequence annealing begins with the null alignment (all sequences unaligned) and merges single columns (aligns characters) according to the corresponding expected increase in , the similarity to the truth under FSA's statistical model. Whereas progressive alignment methods take large steps in alignment space by aligning entire sequences at each step, the distance-based approach takes the smallest-possible steps of aligning single characters.
“The mathematics of distance-based alignment” in Text S1 gives an in-depth discussion of the objective function and how it arises naturally from FSA's representation of an alignment as a partially ordered set (POSET) or directed acyclic graph (DAG).
In FSA's definition of an alignment, an alignment consists solely of a specification of homology. This definition differs from the standard definition of a multiple alignment, which implies an evolutionary history of substitution and indel events. For example, FSA (internally) considers the two alignments shown in Figure 3 to be equivalent.
This is problematic for comparative genomics analyses which use an alignment to infer evolutionary parameters such as indel frequencies across a clade. In order to output a single global alignment from which such evolutionary parameters can be inferred, we choose a topological ordering of the alignment POSET which corresponds to a maximum-parsimony interpretation of indel events. FSA outputs the global alignment with the minimum number of “gap openings” across the individual sequences (the right-hand alignment in Figure 3). As proved in Text S1, FSA can accomplish this in time linear in the number of sequences and sequence length.
FSA can efficiently align hundreds or even thousands of sequences. By default it performs exhaustive distance-based alignment, which requires an all-pairs comparison of the sequences, costing . However, this prohibitive computational cost can be sharply reduced by only performing pairwise comparisons on a subset of all possible sequence pairs.
In order to ensure a complete alignment, where no sequence is left unaligned, each sequence must be connected to every other sequence by a series of pairwise comparisons. For input sequences, a minimum of pairwise comparisons are necessary to give a complete alignment; this corresponds to building a spanning tree on the sequences. While this is sufficient to give a complete alignment on the input sequences, the results will depend heavily on which pairwise comparisons are used to construct the alignment and many choices may give poor alignments. Developing a good theory of which pairs to use to construct the best alignment with the fewest comparisons—how to select a randomized subset of pairs for comparison which falls between the extremes of and pairs—remains an open problem.
We therefore chose to use a completely-randomized approach inspired by results from the theory of Erdös-Rényi random graphs. By the Erdös-Rényi theory, a random graph will almost surely be connected if the edge-creation probability satisfies , which is very low for large ( is a small positive number). Guided by this result, FSA performs fast alignments by first constructing a spanning tree on the input sequences and then performing each possible pairwise comparison with some probability proportional to the connectedness threshold. The savings are dramatic—for sequences, the Erdös-Rényi threshold probability is 0.007, corresponding to an algorithm which is over 100 times as fast as an all-pairs comparison—and we have observed little loss of accuracy from considering only a subset of pairs.
This speedup reduces the time complexity of both inference and sequence annealing. The cost of inference is reduced to and the “worst average case” runtime of sequence annealing to , where we align sequences of length by making pairwise comparisons (Text S1, “The mathematics of distance-based alignment”).
FSA can align megabase-long sequences using an “anchor annealing” strategy. Analogously to other genome alignment programs such as MAVID , MLAGAN , CHAOS/DIALIGN  and Pecan , FSA uses long matches to anchor regions of the alignment and performs inference with dynamic programming in between anchors. FSA's basic anchoring mode uses the fast suffix trie matching program MUMmer  to find candidate anchors and can find anchors in either nucleotide or protein space (by translating the sequence in all frames). FSA requires that anchors be maximal unique matches in both sequences (“MUMs”). The restriction to unique matches helps to prevent false-positive anchors due to, e.g., repetitive sequence; for example, a microsatellite can appear as a candidate anchor only if it appears exactly once, with identical copy number, in each sequence.
FSA utilizes its distance-based approach to find a consistent set of anchors across all sequences simultaneously, thereby making maximal use of additional constraints from other sequences. This “anchor annealing” strategy is conceptually similar to the procedures used in programs for aligning long sequences such as CHAOS/DIALIGN, MAVID, Pecan and TBA, which return partially-ordered sets of anchors, thereby permitting constraints to be projected across multiple sequences.
As with sequence annealing, this “anchor annealing” can be accomplished efficiently with a greedy algorithm based on the Pearce-Kelly algorithm. FSA uses the same code for both sequence and anchor annealing, although the objective function is different: Anchor “scores” correspond to p-values under a null model rather than probabilities of homology, and so there are no “gap” probabilities or which contribute to the anchor-annealing analog of the expected accuracy .
Rather than aligning entire anchors across the multiple alignment in order to find a consistent set of anchors, FSA finds a set of anchor centroids which are consistent across all sequences and then prunes the resulting anchors at a pairwise level. The result is a set of anchors between pairs of sequences whose centroids are consistent across all sequences and which are non-overlapping between pairs of sequences. This heuristic approach permits FSA to quickly find consistent anchors across many sequences.
After finding a consistent set of anchors across the multiple alignment, FSA assumes that these anchors are correctly aligned with probability ~1 and then aligns the regions between anchors using dynamic programming. When anchored alignment is performed, the posterior probabilities that individual characters are aligned, which FSA uses to inform construction of the multiple alignment, are conditioned on the set of anchors chosen. Therefore, if all anchors correspond to true homology then these probabilities will be correctly estimated despite the anchoring heuristic; however, if incorrect anchors are chosen, then individual probabilities of alignment can be similarly incorrect.
While FSA's restriction of candidate anchors from MUMmer to MUMs produces a very specific set of anchors, the restriction can be too stringent to obtain sensitive alignments of diverged or very long sequences, for which there are few unique exact matches. To address this potential problem, FSA can use the fast homology-search program exonerate  to find inexact matches to use as anchors as well. Furthermore, when performing whole-genome alignment, homologous genomic regions are typically first identified with a program such as Mercator  and then each region is aligned with a nucleotide-level alignment program. FSA can use the seed matches, frequently coding genes, which define the homologous genomic regions to inform its anchor annealing.
Because FSA uses the fast tool MUMmer to find anchors, it can rapidly align closely-related sequences, such as mitochondrial DNA, for which anchors span most of the alignment, making costly dynamic programming largely unnecessary.
The distinct functional constraints acting on biological sequences give rise to very different patterns of molecular evolution, each implying distinct parameterizations of an appropriate model for alignment. For example, if substitutions or indels are more frequent in one lineage than in the others, then using a single model for all sequences (which does not reflect these differing constraints) can give misleading results. Nonetheless, sequence alignment algorithms traditionally use a single model for all sequences.
In order to overcome these difficulties, FSA uses “query-specific learning,” wherein a different model is learned for each pairwise comparison (the “query”). This is done in a completely unsupervised framework: FSA uses an unsupervised Expectation Maximization (EM) algorithm to estimate transition (gap) and emission (substitution) probabilities of the Pair HMM on the fly.
Despite its unsupervised nature, FSA's query-specific learning needs remarkably little sequence data to effectively learn parameters. Standard alignment algorithms estimate parameters from thousands or tens of thousands of pairs of aligned sequences; in contrast, we empirically observe good results with as little input data as two unaligned DNA or RNA sequences of length 60 nucleotides or four unaligned protein sequences of length 266 amino acids. These figures correspond to observing each of the independent parameters of a substitution matrix four times.
While FSA learns distinct transition parameters for every pair of query sequences regardless of the sequence composition, it uses different learning strategies for nucleotide and amino acid emission matrices. Because a pair emission matrix over aligned nucleotides has only (42−1)=15 free parameters, FSA can learn a different emission distribution for every pairwise comparison of all but the shortest RNAs or DNAs (this allows FSA to learn a different model for each pair of unanchored subsequences when performing anchored aligment). In contrast, emission matrices over aligned amino acids have (202−1)=3,999 free parameters, thereby preventing FSA from learning independent models for each pair of proteins. FSA therefore learns a single emission matrix using an all-pairs comparison for protein sequences.
Because FSA uses unsupervised learning on very sparse data, overfitting is a serious concern. FSA attempts to prevent overfitting by (1) using a weak Dirichlet regularizer (prior) when estimating both transition and emission probabilities, and (2) terminating parameter learning before the EM algorithm converges. By default the Dirichlet emission priors are scaled such that total number of pseudocounts for aligned characters is equal to the number of free parameters in a symmetric pair emission matrix. As is the case for other machine-learning algorithms, it can be shown that termination before convergence of query-specific learning is equivalent to a form of regularization (likelihood penalty).
If there is insufficient sequence data for effective learning, then FSA does not estimate parameters and instead uses default parameters to construct an alignment. The default parameters values, as well as seeds used for the learning algorithm, are discussed in Text S1.
While FSA can align hundreds or thousands of sequences on a single computer, it can handle truly large-scale problems by running in a parallelized mode on a computer cluster. FSA's distance-based approach to alignment gives the multiple alignment a natural independence structure: each pairwise alignment is independent of all other pairs, allowing dramatic runtime reduction by distributing the individual pairwise computations to different processors.
Two factors were considered for the parallelization of FSA : (1) communication overhead between nodes, and (2) workload distribution over the different processors. For example, distributing jobs in very small batches may reduce processor idle time but lead to high overhead; in contrast, using large batches may increase idle time but minimize overhead. FSA's parallelization mode uses the “Fixed-Size Chunking” strategy described in , whereby each of the processors runs on chunks of pairwise comparisons.
While the pairwise comparisons can be naturally parallelized, sequence annealing does not have the same obvious independencies. Therefore, even when running in parallelized mode, FSA performs sequence annealing on a single node. The parallelization of the annealing step is a future aim for this project. A schematic of the current parallelization strategy is given in Figure 4.
As a greedy algorithm, sequence annealing is only guaranteed to find a local optima of the expected accuracy. FSA therefore uses an iterative refinement strategy after sequence annealing terminates to locally improve the alignment. For each round of iterative refinement, FSA looks at every character's position in the multiple alignment and sees whether the objective function can be improved by moving it to another position (without violating the consistency constraints of the multiple alignment). FSA assembles a list of such candidate character shifts, orders the list by the expected change in the objective function, and then performs the shifts. Iterative refinement terminates when there are no candidate shifts which improve the objective function.
FSA's included GUI permits the user to visually assess alignment quality under FSA's statistical model according to estimates of different measures, including expected accuracy, sensitivity, specificity, consistency and certainty. This permits biologists and bioinformaticians to incorporate reliability measures into downstream analyses, such as evolutionary rate measurements and phylogenetic reconstruction. Incorporating such information can produce distinctly different results. For example, over-aligned non-conserved sequence can cause a systematic bias towards long branch lengths; this can be ameliorated by incorporating the expected accuracy statistics produced by FSA into reconstruction algorithms. Figure 5 shows a sample protein alignment colored by the expected alignment accuracy under FSA's statistical model as well as the true accuracy (based on a reference structural alignment).
FSA's GUI can color alignments according to five different measures of alignment quality, which are approximated under its statistical model. Characters in a multiple alignment can be colored according to:
See Text S1 for detailed descriptions of how these measures are defined and calculated using FSA's statistical model.
The GUI also provides a visual and statistical guide when manually editing alignments.
We benchmarked FSA against databases of multiple alignments compiled from reference structural alignments, including protein databases (BAliBASE 3  and SABmark 1.65 ), small RNA databases (BRAliBase 2.1 ), large RNA databases (Consan mix80 ), and both mammalian  and fly , simulated DNA alignments.
Alignment programs are commonly used to detect homology among input sequences. We conducted a series of false-positive experiments to test whether commonly-used alignment programs can reliably identify homologous and non-homologous sequence. Surprisingly, we found that for most alignment programs, aligned sequences are not necessarily homologous, indicating that biologists should use caution when interpreting the multiple alignments produced by many commonly-used tools.
We additionally performed a simple test of the consistency of common programs when aligning coding sequence: We aligned 1,502 genes orthologous across seven species of yeast in both nucleotide and protein space and compared the resulting alignments. Many programs gave surprisingly discordant results, indicating that at least one of these two alignments produced by commonly-used programs is incorrect.
Table 1 shows benchmarks of FSA and other alignment programs, including AMAP , ClustalW ,, DIALIGN ,, MAFFT , MUMMALS , MUSCLE , Probalign , ProbCons , T-Coffee , and SeqAn::T-Coffee , against the BAliBASE 3  and SABmark 1.65 databases . FSA in maximum-sensitivity mode had accuracy similar to those of the better-performing programs on BAliBASE 3 and had accuracy superior to that of any other program on SABmark 1.65 when run in default mode. FSA had higher positive predictive values than any other tested program on all datasets. Remarkably, FSA was the only tested program which achieved a mis-alignment rate <50% on the standard SABmark 1.65 datasets; all other programs produced more incorrect than correct homology statements.
In order to test the robustness of alignment programs to incomplete homology, we modified the BAliBASE 3 database such that every alignment included a single false-positive, an unrelated (random) sequence. This is a realistic setup for biologists who might want to decide whether a sequence is orthologous to a particular protein family. With the exception of FSA, the tested alignment programs suffered a false-positive rate increased by over 25% on this BAliBASE 3+fp dataset, indicating that they aligned the random sequence with the homologous set. In contrast, FSA left the random sequence unaligned and had an essentially-unchanged false-positive rate.
BRAliBase 2.1 was assembled from the RFAM  RNA database and consists of small RNAs (average length of ~150 nt). FSA gave improved performance even on this high-identity dataset where most programs did relatively well.
The Consan mix80 dataset of alignments of Small and Large Subunit ribosomal RNAs, assembled from the European Ribosomal RNA database , was created for training RNA structural alignment programs and provides a test of alignment programs on difficult, large-scale alignments. The four alignments contain from 107 to 254 sequences, each 1–4 kilobases in length, with average percentage identity less than <50%. Two tested alignment programs, ProbConsRNA  and SeqAn::T-Coffee, were unable to align these large datasets. This dataset demonstrates that FSA's alignment speedup options, including performing inference only on a subset of all possible pairs (–fast) and anchoring alignments instead of using the full dynamic programming matrix (–anchored), are effective heuristics for large datasets.
Table 3 shows benchmarks of FSA and other genomic alignment programs, including CHAOS/DIALIGN , DIALIGN-TX ,, MAVID , MLAGAN , Pecan  and TBA , on simulated alignments of both mammalian and Drosophila DNA sequences. FSA produced higher-accuracy alignments than the other programs on the two Drosophila datasets and only Pecan gave better alignments of the mammalian sequences.
The simulated alignments of nonfunctional DNA sequences from nine mammals (human, chimp, baboon, mouse, rat, cat, dog, cow, and pig) were created by Blanchette et al. . The simulated alignments of DNA from the twelve species of Drosophila described in  were created with two simulation programs, DAWG  and simgenome . As described in , the simulated Drosophila genomic alignments were created by parameterizing the DAWG and simgenome programs using whole-genome alignments produced by Pecan for . Although two authors (RKB and IH) of this manuscript contributed to the simgenome program, simgenome was developed prior to FSA and did not influence or contribute to the methodology described here for FSA.
FSA's strong performance on all three datasets of simulated long DNA sequences indicate that it is a useful and accurate tool for genomic alignment.
In order to further test the appropriateness of using popular alignment programs to detect homology between sequences, we conducted a large-scale random-sequence experiment. We generated datasets of random sequences to simulate unrelated protein, short DNA, and genomic (long) DNA sequences. The results, shown in Table 4 and Table 5, clearly demonstrate that while for most alignment programs, aligned sequences are not necessarily homologous, FSA leaves random sequences largely unaligned.
Biologists commonly align coding regions in both amino acid and nucleotide space, but there have been few studies of the effectiveness of alignment programs across the two regimes. We tested the consistency of alignment programs on coding sequence by aligning all 1,502 genes in Saccharomyces cerevisiae identified as having orthologs in the six related yeast species S. paradoxus, S. mikatae, S. kudriavzevii, S. bayanus, S. castellii, and S. kluyveri (; this dataset was also analyzed in ). As shown in Table 6, alignments produced by FSA had higher concordance (0.943) than those produced by any other program. If a program produces alignments with low concordance between amino acid and nucleotide alignments, then at least one of the alignments must be incorrect (and quite possibly both are).
This simple test additionally indicates the effectiveness and robustness of FSA's query-specific learning. While very different learning procedures are used for amino acid and nucleotide sequence, the concordant alignments inferred by FSA indicate that our results are robust with respect to the details of the learning procedure.
We conducted an ablation analysis of FSA's components to test the effectiveness of each component and ensure that they all contributed to the accuracy of the program. As indicated by the results in Tables 7– 10,10, each optional component of FSA contributes to its accuracy.
The importance of each component depends strongly upon the alignment problem. The –fast heuristic for reducing the number of sequence pairs used to construct an alignment results in little loss of accuracy, at least on the benchmarks used in this paper (Tables 7 and and8).8). As indicated by the small and long RNA benchmarks (Table 8), iterative refinement is important for aligning many sequences and less so for small alignment problems. The anchor annealing procedure appears to be an effective heuristic for aligning long sequences. Anchoring with unique matches (MUMs) causes only a negligible loss of accuracy on the long RNA dataset (Table 8). However, results on simulated long DNA sequences (Table 9) demonstrate that inexact matches, such as those returned by exonerate, must be used during anchor annealing to obtain high sensitivity on very long or distant nonfunctional DNA sequences. Nonfunctional DNA lacks the local constraints which preserve exact matches across distant species in functional (e.g., coding) sequence. Query-specific learning is important for maintaining FSA's robustness to non-homologous sequence. While FSA aligned only 4% of random protein sequences in default mode, when run without learning it aligned 13% (Table 10), similar to the 14% aligned by AMAP (Table 4).
Biologists commonly perform alignments of hundreds or thousands of 16S ribosomal DNA sequences in order to elucidate evolutionary relationships and build phylogenetic trees. We performed alignments of prokaryotic 16S sequences to compare the speed of commonly-used programs (Table 11). MAFFT was the fastest method by an order of magnitude; MUSCLE and FSA were the next-fastest methods. Many alignment programs were unable to align these large datasets.
The results in Table 12 and Table 13 demonstrate the effectiveness of FSA's parallelization mode. Parallelizing the pairwise comparisons dramatically reduces runtime: When running in –fast mode on a small cluster with 10 processors, FSA can align 500 16S sequences in 20% of the time required without parallelization.
In the Introduction we highlighted four design criteria which we emphasized in the development of FSA. The first goal was to find alignments with high expected accuracy, where an accurate alignment has minimal distance to the truth. This objective function is markedly different from both the maximum-likelihood approaches used by programs such as ClustalW and MUSCLE and the maximum expected sensitivity approaches used by programs such as ProbCons and Pecan. Note that while the objective function used in ProbCons is called “maximum expected accuracy” in the paper , it is actually a maximum expected sensitivity objective function, where there is no penalty for over-aligning sequence (c.f., the results shown in Table 4). Their objective function can be recovered as a special case of our approach by ignoring the gap probabilities in FSA's objective function (Text S1, “The mathematics of distance-based alignment”). FSA's explicit search for the most accurate, rather than most likely or most sensitive, alignment is what allows FSA to so dramatically outperform most other programs on tests on unrelated sequence (Table 4).
We believe that this accuracy criterion, which gives equal weight to the correctness of all sequence positions, is a natural measure of alignment quality. Downstream analyses, such as phylogenetic reconstruction and evolutionary constraint analysis, are increasingly using indels in addition to homologous characters for more accurate estimation (e.g., ,). Thus, it is important that alignments be as “evolutionarily correct” as possible , which is the purpose of the accuracy criterion.
FSA's strong performance under the accuracy criterion is due to techniques such as its iterative refinement as well as its explicit attempt to maximize the expected accuracy; programs which explicitly seek to maximize an objective function of the posterior probabilities of character alignment, such as ProbCons or Probalign, could instead seek to maximize the expected accuracy described here and, as a likely result, increase their robustness to non-homologous sequence. However, while we believe that the expected accuracy is a biologically-sensible objective function, it may not be appropriate if the user desires the most sensitive alignment. While FSA can produce the most-sensitive RNA alignments, other programs can produce more sensitive alignments of proteins and genomic sequence, albeit generally at the cost of a tendency to align non-homologous sequence (Table 4).
The second goal was to create alignments which are robust to evolutionary distances and different functional constraints on patterns of molecular evolution. FSA's unsupervised query-specific learning for parameter selection frees the user from unknown systematic biases implicitly introduced by using an alignment program whose parameters were trained on a dataset whose statistics may not reflect those of the sequences to be aligned. For example, it is traditionally challenging to align sequences with unusual base composition, but FSA can easily handle such alignments by automatically learning appropriate parameters. As indicated by our ablation analysis, query-specific learning further increases FSA's robustness to non-homologous sequences beyond that offered by the minimum-distance objective function alone.
We believe that FSA's unsupervised query-specific learning is the first time a multiple alignment program has been capable of dynamically learning a complete parameterization, wherein parameters can vary for each pair of sequences to be compared, on the fly. This learning method is related to the “pre-training” option in ProbCons, which permits users to learn different models for families of homologous sequences, but does not permit parameterizations to vary between sequence pairs. We also note that the MORPH program for pairwise alignment of sequences with cis-regulatory modules learns model transition parameters from data . While supervised training on curated data can give superior performance on test sets which are statistically-similar to the training data, the practical alignment problems encountered everyday by biologists do not fit into this rigid problem setup. Query-specific learning consistently gives reasonable performance.
The third and fourth goals, to develop a single, modular program which can address all typical alignment problems encountered by biologists, are naturally achieved within FSA's architecture. While almost all alignment programs are designed to either align many short sequences or a few long sequences, we have demonstrated that it is feasible to create a single program which can address both situations. This is made practical by FSA's modular nature, where the statistical model for pairwise comparisons, the anchoring scheme for finding homology between long sequences, and the sequence annealing procedure are entirely separate and can be individually modified and improved. For example, the parallelization of FSA was designed and developed with minimal changes to the rest of FSA's code base. Similarly, while FSA's basic anchoring relies only on exact matches from MUMmer, the anchoring scheme was easily extended to incorporate inexact matches from exonerate  and alignment constraints from Mercator . In fact, this flexibility permits FSA to incorporate almost any sources of potential homology information, from predicted transcription factor binding sites to entire gene models. One natural extension of FSA's approach is to models of RNA alignment which take structure into account. The program Stemloc-AMA  uses a model of the pairwise evolution of RNA secondary structure in conjunction with the sequence annealing algorithm to create accurate multiple alignments of structured RNAs. By using Stemloc-AMA's probabilistic model rather than a Pair HMM and taking advantage of techniques such as query-specific learning, FSA could sum over possible pairwise structural alignments in order to get better estimates of posterior probabilities of character alignment.
FSA is a statistical alignment program insofar as it uses an explicit statistical model of alignments and a probabilistic objective function for optimization, but as discussed in “Theoretical justification of distance-based alignment” (Text S1), it also is a distance-based approximation to the “phylogenetic alignment” models of alignments on trees –, –. While traditional phylogenetic alignment algorithms are currently too computationally-expensive to use on datasets of more than a few sequences, FSA's distance-based method allows biologists to use the sophisticated tools of statistical alignment algorithms on practical problems. Furthermore, while we have not addressed the phylogenetic aspects of FSA in detail in this paper, our methods can be adapted to incorporate a complete phylogenetic model (Text S1, “The mathematics of distance-based alignment”). However, we believe that FSA's current approach, which is agnostic to phylogeny, offers many practical advantages for common genomics analyses. For example, because FSA uses a sum-of-pairs objective function, there is no guide tree to potentially bias downstream phylogenetic reconstructions based on the alignment. Similarly, while most genomic alignment programs require a species tree to perform the alignment, our phylogeny-free approach will be avoid potential biases introduced by using a single species tree to align regions which may have undergone recombination.
(0.23 MB PDF)
FSA borrows heavily from previous work, both in its code base and its intellectual foundations.
Ideas. The distance-based approach to multiple alignment was proposed in ,. This included the idea of modifying the accuracy criterion suggested  and  to include gaps and the demonstration that the resulting modified expected accuracy could be used to control the expected sensitivity and specificity. Furthermore, , introduced the sequence annealing approach to building multiple alignments, via the description of alignments using partially ordered sets ,,. The graph-based approach to alignment was formalized by  and these results were used in the DIALIGN program .
The query-specific learning method for re-estimating alignment parameters on the fly was inspired by  and conversations with Joseph Bradley about query-specific structure learning of graphical models.
The iterative refinement technique is based on ideas in .
The FSA algorithm was parallelized using a modification of the approach in MW .
The coloring in the GUI according to posterior probabilities of alignment is inspired by the AU viewer of BAli-Phy .
FSA's query-specific learning uses code created with Gerton Lunter's HMMoC compiler for HMMs . The “aligner” example distributed with HMMoC, which implements a learning procedure for gap parameters, was an inspiration for FSA's learning strategies. FSA's banding code is taken directly from the “aligner” example.
FSA's sequence and alignment representation code was inspired by similar code in the dart library . Several Perl packages distributed with FSA are derived from packages of the same name in dart.
The authors have declared that no competing interests exist.
Ian Holmes was partially supported by NIH/NHGRI grant 1R01GM076705. Lior Pachter was partially supported by NSF CAREER award CCF 03-47992. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.