Recent years have seen the emergence of the field of phyloinformatics
]. At a practical level this is research where much of the organizational challenge lies in managing data
, including character state matrices or multiple sequence alignments, phylogenetic trees and the relationships between these, and metadata
, including cross references to molecular sequence databases, taxonomies, character state descriptions, biodiversity data, and literature references. At the nexus of the relationships between character state data and phylogenies lies the operational taxonomic unit (OTU), i.e. the biological entity on which observations are made (e.g. by measuring morphological traits or by sequencing DNA) and which is placed as a terminal node in a phylogeny.
In the course of a phyloinformatic analysis, data and metadata are collected or generated, transformed, filtered, analyzed and summarized before they can be interpreted to answer meaningful biological questions. Based on first principles of good science such steps should be reproducible; and, in practice, analysis steps often need to be redone by the researcher multiple times [2
] and are too error-prone, tedious and time-consuming to perform manually. Hence, environments that allow such analyses to be scripted programmatically can greatly improve the efficiency and reproducibility of phyloinformatics.
Some of these facilities are provided by DendroPy [3
], ETE [4
] and BioPython [5
] for the Python programming language, by the Ape package [6
] for the R environment, and by BioPerl [7
] for the Perl programming language. However, some of these (DendroPy, Ape), while strong on tree shape simulation and analysis, do not integrate easily in workflows that include external software for sequence alignment and phylogenetic inference or database or web service access, while others (BioPython, BioPerl, ETE) are strong in that respect but are lacking in tree shape simulation and analysis. In addition, none of these toolkits have a facility for managing the syntax and semantics of metadata. This can cause confusion when integrating and sharing metadata from disparate sources. For example, if an OTU is annotated with a taxonomic identifier, where (e.g. which database) does the identifier come from? What is the relation between the OTU and the database record (e.g. is the relationship established by a simple string match or something else)?
The Bio::Phylo toolkit addresses these issues, allowing researchers to read and write previously unsupported data formats, generate and transform data in a variety of ways, compute heretofore unimplemented topological indices, apply heretofore unavailable sampling and resampling algorithms and visualize the results in publication-ready graphics, while allowing phylogenetic knowledge to be managed, represented and shared in ways that preserve its meaning and its relation to metadata, regardless of its origin or context.
Due to its implementation in the Perl programming language and its compatibility with the BioPerl [7
] toolkit, the operations supplied by Bio::Phylo are easily integrated in larger analysis workflows that take advantage of the operations supplied by BioPerl and that interface with command-line executables and web services, e.g. for database access or for computationally intensive analysis steps. However, BioPerl is very large and for many users difficult to install, whereas Bio::Phylo has no required dependencies. This makes deployment much easier for users who only require Bio::Phylo's functionality-orientated towards phyloinformatics per se-but not BioPerl's.