|Home | About | Journals | Submit | Contact Us | Français|
Summary: We present a suite of Unix shell programs for processing any number of phylogenetic trees of any size. They perform frequently-used tree operations without requiring user interaction. They also allow tree drawing as scalable vector graphics (SVG), suitable for high-quality presentations and further editing, and as ASCII graphics for command-line inspection. As an example we include an implementation of bootscanning, a procedure for finding recombination breakpoints in viral genomes.
Availability: C source code, Python bindings and executables for various platforms are available from http://cegg.unige.ch/newick_utils. The distribution includes a manual and example data. The package is distributed under the BSD License.
Phylogenetic trees are a fundamental component of evolutionary biology, and methods for computing them are an active area of research. Once computed, a tree may be further processed in various ways (Table 1). Small datasets consisting of a few trees of moderate size can be processed with interactive GUI programs. As datasets grow, however, interactivity becomes a burden and a source of errors, and it becomes impractical to process large datasets of hundreds of trees and/or very large trees without automation.
Automation is facilitated if the programs that constitute an analysis pipeline can easily communicate data with each other. One way of doing this in the Unix shell environment is to make them capable of reading from standard input and writing to standard output—such programs are called filters.
Although there are many automatable programs for computing trees [e.g. PhyML (Guindon and Gascuel, 2003), PHYLIP (Felsenstein, 1989)], programs for processing trees [e.g. TreeView (Page, 2002), iTOL (Letunic and Bork, 2007)] are typically interactive. Here, we present the Newick utilities, a set of automatable filters that implement the most frequent tree-processing operations.
The Newick utilities have the following features:
Bootscanning (Salminen, 1995) locates recombination breakpoints by identifying (locally) closest relatives of a reference sequence. An example implementation is as follows:
The distribution includes a Bash script, bootscan.sh, that performs the procedure with Muscle (Edgar, 2004) (Step 1), EMBOSS (Rice et al., 2000) (Step 2), PhyML (Step 3), GNUPlot (Step 6) and Newick utilities for Steps 4 and 5. This method was used to detect breakpoints in human enterovirus (Tapparel et al., 2007).
The Newick utilities add tree-processing capabilities to a shell user's toolkit. Since they have no hard-coded limits, they can handle large amounts of data; since they are non-interactive, they are easy to automate into pipelines, and since they are filters, they can easily work with other shell tools.
Tree processing may also be programmed using a specialized package [e.g. BioPerl (Stajich et al., 2002), APE (Paradis et al., 2004) or ETE (Huerta-Cepas et al., 2010)], but this implies knowledge of the package, and such programs tend to be slower and use more resources than their C equivalents. The difference is particularly apparent for large trees (Fig. 2).
To combine the advantages of a high-level, object-oriented language for the application logic with a C library for fast data manipulation, one can use the Newick utilities through Python's ctypes module. This allows one to code a rerooting program in 25 lines of Python while retaining good performance (Fig. 2). A detailed example is included in the documentation.
Some users will feel more at ease working in the shell or with shell scripts, using existing bioinformatics tools; others will prefer to code their own tools in a scripting language. The Newick utilities are designed to meet the requirements of both.
We wish to thank the members of the E.Z. group for feedback and beta testing.
Funding: The Infectigen Foundation; Swiss National Science Foundation (grant 3100A0-112588 to E.Z.).
Conflict of Interest: none declared.