An increasing body of research is based on the classification and evaluation of the rate of DNA evolution in coding regions. In many cases nucleotide substitutions are classified according to their impact on the encoded protein and the resulting classification can be used for a variety of analyses. Classification of synonymous (KS
) and non-synonymous (KA
) substitutions can be used to detect the presence or absence of selection (1
) and the classification of substitutions according to codon position can be used with sophisticated evolutionary models to better reconstruct phylogenies. In both of these cases the programs that perform the analysis, such as PAML (2
), require a codon alignment as input.
Owing to the degeneracy of the genetic code, where a single amino acid can be encoded by multiple codons, it is often preferable to align protein sequences rather than the underlying coding DNA as it increases sensitivity at longer evolutionary distances and prevents the introduction of frame shifts into an alignment. Thus the construction of a protein alignment first and then reverse translating this into a codon-based DNA alignment is invariably the optimal solution and provides reliable alignments to perform correct evolutionary analyses. In the ideal case where the protein and the corresponding DNA match perfectly, the conversion from a protein alignment into the corresponding codon alignment can be achieved by replacing each amino acid residue with three nucleotide residues, a procedure which is implemented in several tools, e.g. aa_to_dna_aln in the Bioperl toolkit (3
), RevTrans (4
), transAlign (5
) and aa2dna (http://www.bio.psu.edu/People/Faculty/Nei/Lab/aa2dna.zip
). However it often happens that the corresponding DNA sequence has mismatches when compared with its theoretical protein sequence, or contains untranslated regions (UTRs) and polyA tails. In such instances, the conversion process is much more complicated. Moreover the analysis of pseudogenes, which are an interesting subject of molecular evolution studies, requires dealing with frame shifts and inframe stop codons. These situations, which are rather frequent in large-scale analysis of sequenced genomes, cannot be solved by the programs mentioned above and therefore require additional solutions.
Here we describe a web server, PAL2NAL, which converts a protein sequence alignment into the corresponding codon alignment, despite the presence of mismatches between the protein and the DNA sequences, UTRs and polyA tails in the input DNA sequences, and frame shifts and inframe stop codons in the input alignment. Another useful feature of this server is that it is possible to obtain codon alignments for specific regions of interest, such as functional domains or particular exons by selecting the positions in the input protein sequence alignment.