The webPRANK server (Figure ) supports the alignment of DNA, protein and codon sequences, input in FASTA format [8
], using evolutionary substitution models [9
]. It can translate, align as protein and back-translate protein-coding DNA sequences. In addition, webPRANK includes built-in support for two structure models [6
, designed for aligning genomic DNA sequences with sites evolving with different substitution dynamics and differences in the patterns of alignment gaps. webPRANK accepts a user-defined phylogeny (Newick format) to guide its progressive alignment procedure, or can compute one from the unaligned input sequences. For each alignment task, the full combination of parameters, and the structure model if used, are provided in the output so that the analyses can easily be repeated or recreated with the stand-alone PRANK program.
The size of alignment tasks is limited to 4 GB of memory and 24 hours of run time. The size and type of data as well as the parameter settings affect the computation time. The PRANK algorithm has time complexity O(a2nl2) where a, n and l are the size of the character alphabet (four for DNA; 20 for amino acids; 61 for codons) and the number and length of sequences, respectively. (More precisely, l is the length of the sub-alignments to be aligned and, for large n, can be much longer than any of the extant or inferred ancestral sequences.) The alignment of 30 DNA sequences of ~1000 nucleotides typically takes 1-2.5 minutes depending on the options chosen; that of 100 DNA sequences of similar length 3.5-20 minutes. The translation of DNA sequences to amino acids or codons decreases sequence lengths but increases alphabet size, requiring computation times similar to (for amino acids) or significantly longer than (for codons) those for untranslated DNA sequences. By default, webPRANK uses alignment anchoring to accelerate analyses of long DNA sequences.
Significant proportions of the longer time estimates for the alignment tasks are spent computing the guide trees and, if a user-defined phylogeny is provided, even larger data sets can be aligned in a reasonable time. With a pre-defined guide tree, the alignment of 1000 simulated DNA sequences of ~1000 nucleotides could be performed in 35 minutes; however, the alignment matrix was 7247 columns wide (the correct width was 7235 columns) and so sparse that it was largely unreadable (see Additional file 1
). In practice webPRANK is able to align and display (see below) almost any set of sequences for which subsequent alignment browsing is feasible, and many realistic sets for which it is not.
The webPRANK-generated alignments can be downloaded in several alignment formats widely used in evolutionary analyses. The webPRANK server supports its own HSAML format, as well as FASTA [8
], PHYLIP (interleaved and sequential) [12
], PAML [13
] and NEXUS [14
] formats. The XML-based HSAML format is the only one we know that can contain the full information of the alignment process and allows for advanced analysis and post-processing of the results with the integrated webPRANK browser or using the stand-alone PRANKSTER alignment browser. The format can also be easily parsed using external software, for example the XML library for the R statistics package [15
] or the libXML module for the Perl programming language, allowing for complex downstream analyses of the alignment data. Of the classical alignment formats, the NEXUS format also allows incorporation of some additional information in the alignment files: webPRANK extends alignments exported in NEXUS format to include the alignment guide tree and the column-wise minimum posterior reliability scores or the excluded alignment sites (see below) using the appropriate commands in the 'Trees', 'Assumptions' and 'Paup' blocks, respectively.
Before downloading the results, the sequence alignments can be visualised and post-processed using a powerful, integrated alignment browser (Figure ). A distinctive feature of the webPRANK browser is the display of an interactive cladogram, representing the alignment guide tree, next to the sequences. The tree has two purposes. First, we believe that evolutionary sequence alignment should always be studied in the context of the tree relating the sequences. The fact that the guide tree used for the alignment may not be fully correct does not change this, as the tree has nevertheless been used for the alignment and the solution depends on it. Rather than hiding the tree, showing it alongside the alignment helps to identify possible errors and suggest actions to correct them. Second, the PRANK alignments contain additional information associated with the tree nodes and the easiest way to represent and allow browsing this information is in the context of the tree.
The PRANK algorithm can compute column-wise reliability scores for the alignment and, when a structure model is used, provide posterior probabilities for the alignment sites evolving under different evolutionary processes [6
]. The reliability and probability values are generated by the pair-wise alignments at the different levels of the progressive alignment and are thus associated with the internal nodes of the tree. The information is displayed below the alignment as probability tracks (Figure ). The tracks for different stages of the alignment can be selected by clicking the corresponding nodes in the tree or using the drop-down menu.
The PRANK alignment reliability scores provide an objective measure to remove less reliably aligned columns from the data and the webPRANK browser includes advanced functionality to select sets of alignment sites using these scores. The webPRANK filtering is based on the track currently displayed; repeated steps of filtering are accepted and, for convenience, an additional track showing the minimum reliability score across all pair-wise alignments is provided. The current selection of alignment sites is indicated in the browser window using different colouring (Figure ) and the subset of sites currently selected can be exported in various different alignment formats for the downstream analyses. Unlike other export formats that permanently remove unreliable columns from the data, the files saved in NEXUS format keep the full alignment data and include additional commands excluding a set of sites in the downstream analysis.
As a part of the alignment process, the PRANK algorithm reconstructs the sequence history with inferred ancestral nodes. The inferred ancestral sequences, with phylogenetically realistic patterns of character presence vs. absence, can be displayed in the alignment browser or downloaded for further analyses. Ancestral sequences can also be inferred from existing alignments. One should note, however, that non-phylogeny-aware alignment algorithms tend to infer excess deletions [2
] and inference from systematically incorrect alignments typically produces unrealistically long ancestral sequences. In addition to ancestral sequences, structure predictions and alignment reliability scores can also be computed for existing alignments (Figure ). This allows application of some of the advanced features of the PRANK alignment package to other alignments, e.g. for objectively removing noise from the alignment data.
The webPRANK alignment browser is not limited to the display of de novo alignments: it can be used for visualisation and browsing of any FASTA- or HSAML-formatted alignment, although the full functionality of the browser requires the richer HSAML format. By storing webPRANK-generated alignments in this format, the user can later re-load the results to the webPRANK browser for visualisation and post-processing, and thus perform all alignment-related activity for small sequence analysis projects using a standard web browser only.