|Home | About | Journals | Submit | Contact Us | Français|
The Clustal series of programs are widely used in molecular biology for the multiple alignment of both nucleic acid and protein sequences and for preparing phylogenetic trees. The popularity of the programs depends on a number of factors, including not only the accuracy of the results, but also the robustness, portability and user-friendliness of the programs. New features include NEXUS and FASTA format output, printing range numbers and faster tree calculation. Although, Clustal was originally developed to run on a local computer, numerous Web servers have been set up, notably at the EBI (European Bioinformatics Institute) (http://www.ebi.ac.uk/clustalw/).
One of the cornerstones of modern bioinformatics is the comparison or alignment of protein sequences. With the aid of multiple sequence alignments, biologists are able to study the sequence patterns conserved through evolution and the ancestral relationships between different organisms. Sequences can be aligned across their entire length (global alignment) or only in certain regions (local alignment). The most widely used programs for global multiple sequence alignment are from the Clustal series of programs. The first Clustal program was written by Des Higgins in 1988 (1) and was designed specifically to work efficiently on personal computers, which at that time, had feeble computing power by today's standards. It combined a memory-efficient dynamic programming algorithm (2) with the progressive alignment strategy developed by Feng and Doolittle (3) and Willie Taylor (4). The multiple alignment is built up progressively by a series of pairwise alignments, following the branching order in a guide tree. The initial pre-comparison used a rapid word-based alignment algorithm (5) and the guide tree was constructed using the UPGMA method (6). In 1992, a new release was made, called ClustalV (7,8), which incorporated profile alignments (alignments of existing alignments) and the facility to generate trees from the multiple alignment using the Neighbour-Joining (NJ) method (9). The third generation of the series, ClustalW (10), released in 1994, incorporated a number of improvements to the alignment algorithm, including sequence weighting, position-specific gap penalties and the automatic choice of a suitable residue comparison matrix at each stage in the multiple alignment. In addition, the approximate word search used for the pre-comparison step was replaced by a more sensitive dynamic programming algorithm, and the dendogram construction by UPGMA was replaced by NJ. The ClustalW program looked very similar to ClustalV, with simple text menus for interactive use and the possibility of running the program in batch mode by specifying the input file and the parameter options on the command line.
The rationale behind the development of the Clustal series has been to provide robust, portable programs that are capable of providing good, biologically accurate alignments within a reasonable time limit. A close collaboration between biologists and computer scientists is probably one of the main reasons for the success and continued widespread use of the Clustal programs. ClustalW has given rise to a number of developments, including the latest member of the family, ClustalX (11). Although the alignments produced are the same as those produced by the current release of ClustalW, the user can better evaluate alignments in ClustalX. The program displays the multiple alignment in a scrollable window and all parameters are available using pull-down menus. Within alignments, conserved columns are highlighted using a customizable colour scheme and quality analysis tools are available to highlight potentially misaligned regions. ClustalX is easy to install, is user-friendly and maintains the portability of the previous generations through the NCBI Vibrant toolkit (ftp://ncbi.nlm.nih.gov/toolbox/ncbitools/). Numerous options are provided, such as the realignment of selected sequences or selected blocks of the alignment and the possibility of building up difficult alignments piecemeal, making ClustalX an ideal tool for working interactively on alignments.
Parallel versions of ClustalW and ClustalX have been developed by SGI (http://www.sgi.com/industries/sciences/chembio/resources/clustalw/parallel_clustalw.html), which show increased speeds of up to 10× when running ClustalW/X on 16 CPUs and significantly reduce the time required for data analysis. A number of other significant developments have been based on the ClustalW program. For example, ClustalNet (12) is a Clustal alignment CORBA server and DbClustal (13) is a program for aligning sequences detected by database searches, which uses local alignment information to anchor the global multiple alignment. DbClustal is available on the Web at http://www-igbmc.u-strasbg.fr/BioInfo/DbClustal and forms part of the WU-Blast2 (Washington University BLAST version 2.0) server at the EBI (http://www.ebi.ac.uk/blast2/).
Numerous Web servers have exploited the command line interface of ClustalW, notably the EBI's ClustalWWW Web server, which currently runs between 2000–10000 jobs/day and the SRS server at the same site (http://srs.ebi.ac.uk/), which has ClustalW built in. The EBI ClustalWWW interface provides extensive help, ranging from an introduction to multiple alignments for new users to detailed descriptions of each alignment option. An important factor in obtaining a high-quality alignment is the ability to change the numerous alignment parameters available in ClustalW. While the default values of the parameters have been optimised to work in the majority of cases, they are not necessarily optimal for any given alignment problem. In the ClustalWWW interface, all the options are easily accessible on the top page.
Sequences can be entered by either pasting them or by uploading a file from the user's local computer. In both cases, the sequences should be in one of seven different formats (GCG, FASTA, EMBL, GenBank, PIR, NBRF, Phylip or SWISS-PROT). Although users are encouraged to submit large numbers of sequences, there is no guarantee that the alignment will be completed within the job run limits. Therefore, users who experience problems when attempting to make very large alignments are advised to download the software and run it locally. In addition to the input format, the user can also specify the preferred output format for the multiple sequence alignment. The options are currently ALN, GCG, PHYLIP, PIR and GDE. It is also possible to configure the browser to automatically load the results files from ClustalW into a suitable external application. A list of some example URLs for obtaining such applications for MS-Windows, Macintosh and UNIX systems is provided in Table Table1.1. Many commercial packages, e.g. the GCG package (Wisconsin Package, Genetics Computer Group, Madison, WI) and its X Window graphical user interface, SeqLab, can also accept ClustalW alignments.
The resulting multiple alignments can be displayed as either black and white or colour coded text. An example of the colour coded display is shown in Figure Figure1.1. The alignment consists of four oxidoreductase NAD binding domains. The colouring of residues takes place according to physicochemical criteria highlighting conserved positions in the sequences. A consensus line is also displayed below the alignment with the following symbols denoting the degree of conservation observed in each column: ‘*’ (identical residues in all sequences), ‘:’ (highly conserved column), ‘.’ (weakly conserved column).
A recent enhancement to the ClustalWWW interface has been the addition of an option that allows the user to upload the results of ClustalW into an alignment editor, using a Java Applet called JalView (http://www.compbio.dundee.ac.uk/). JalView is a fully featured multiple sequence alignment editor which allows the user to perform further alignment analysis. Special features include the definition of sequence sub-groups, links to the SRS server at the EBI and an option to output the alignment as a colour postscript file for printing purposes.
As well as constructing multiple alignments, ClustalWWW can also calculate trees from a multiple alignment using the NJ method, a widely used and relatively fast algorithm that clusters sequences by minimising the sum of branch lengths. The resulting evolutionary relationships can be viewed either as cladograms or phylograms, with the option to display branch lengths (or ‘tree graph distances’).
Both ClustalW and ClustalX are being actively maintained and updated. Recent enhancements have included the possibility of saving both alignments and phylogenetic trees in the NEXUS format (14) for compatibility with a number of phylogeny programs. Some work has also been done to optimise the alignment parameters, for example the Gonnet series of residue comparison matrices (15) is now used by default for protein sequence alignments. The latest version of the programs (version 1.83), which was released early this year, contained four main enhancements. The first modification is the facility to save the multiple alignment result as a FASTA format file, for compatibility with a number of other software packages. Another is to provide a percent identity matrix, which some users have asked for. A third new option is the possibility of saving the residue range in the output file when saving a user-specified range of the alignment. This is particularly useful when extracting a single domain from the alignment of multi-domain proteins. For example, in Figure Figure11 the NAD binding domain was extracted from a multiple alignment of the full-length oxidoreductase protein sequences and the residue range was automatically appended to the sequence names. Perhaps the most important enhancement in the latest version, however, is the incorporation of a faster implementation of the NJ algorithm used to construct guide trees during the multiple alignment process and also to construct phylogenetic trees based on the final alignment. Table Table22 contains examples of the time required by the NJ algorithm for the construction of a phylogenetic tree from alignments containing different numbers of sequences. The increased speeds obtained mean that it is now possible to construct phylogenetic trees for very large sets of sequences, which were previously only feasible on very large computer systems. As an example, Figure Figure22 shows a phylogenetic tree constructed from an alignment of more than 1100 ring finger domain sequences taken from the PFAM database (16) entry PF00097. The new NJ implementation was written by T. Koike. An independent acceleration of the NJ algorithm has been published and is freely available as the QuickTree program (17). Though coding details differ, both implementations addressed the major slow points of the original code and so will not produce combinatorial improvement.
We thank the many Clustal users who have provided feedback, bug reports and feature requests. T.K. would like to express his thanks to the Life Science Systems Division at Fujitsu Ltd for allowing him to participate in the bioinformatics research and development at the National Institute of Genetics. J.D.T. was supported by institute funds from the Institut National de la Santé et de la Recherche Médicale, the Centre National de la Recherche Scientifique, the Hôpital Universitaire de Strasbourg and the Fond National de la Science (GENOPOLE).