|Home | About | Journals | Submit | Contact Us | Français|
The growing availability of complete genomic sequences from diverse species has brought about the need to scale up phylogenomic analyses, including the reconstruction of large collections of phylogenetic trees. Here, we present the third version of PhylomeDB (http://phylomeDB.org), a public database for genome-wide collections of gene phylogenies (phylomes). Currently, PhylomeDB is the largest phylogenetic repository and hosts 17 phylomes, comprising 416093 trees and 165840 alignments. It is also a major source for phylogeny-based orthology and paralogy predictions, covering about 5 million proteins in 717 fully-sequenced genomes. For each protein-coding gene in a seed genome, the database provides original and processed alignments, phylogenetic trees derived from various methods and phylogeny-based predictions of orthology and paralogy relationships. The new version of phylomeDB has been extended with novel data access and visualization features, including the possibility of programmatic access. Available seed species include model organisms such as human, yeast, Escherichia coli or Arabidopsis thaliana, but also alternative model species such as the human pathogen Candida albicans, or the pea aphid Acyrtosiphon pisum. Finally, PhylomeDB is currently being used by several genome sequencing projects that couple the genome annotation process with the reconstruction of the corresponding phylome, a strategy that provides relevant evolutionary insights.
Recent developments in sequencing technologies have radically changed the way in which many biologists perform their research. Although, genome sequencing used to be a costly and challenging analysis only reserved for a handful of model species, it is nowadays a technique that can be applied at a reasonable price and effort to significantly larger sets of organisms. As a result, the rate at which new genome sequences are deposited in public databases is growing. The comparison of genomes from diverse species within a common evolutionary framework, i.e. phylogenomics (1), constitutes a powerful tool to address many relevant questions including, among many others, the reconstruction of evolutionary relationships across different species (2), the prediction of function of uncharacterized proteins (3) and the establishment of orthology and paralogy relationships among homologous genes (4). Such comparative studies are often accompanied by large-scale phylogenetic analyses that comprise the reconstruction of evolutionary relationships of a large number of gene families, an approach that has been enabled by recent developments in hardware and phylogenetic algorithms (5). Currently, a number of public repositories for automatically-generated collections of phylogenetic trees do exist (6–10). Such collections vary in size and phylogenetic scope depending on their specific focuses. Most existing approaches rely on a clustering phase to define families of homologous sequences to which methods for phylogenetic reconstruction are applied. Alternatively, a phylogenetic reconstruction pipeline can be recursively applied to every gene in a genome so that each sequence is used as a seed in the process of phylogenetic reconstruction. The resulting collection of trees, representing the full complement of evolutionary histories of all genes encoded in a given genome, has been dubbed with the term ‘phylome’ (11). This strategy, which resembles more closely the gene-centered approach in classical phylogenetics, is computationally more costly than a family-based approach but it is mostly free of the difficulties of properly establishing family boundaries in clustering approaches. Moreover, a gene-centric approach ensures maximum coverage of the seed genome and, therefore, is more appropriate when the focus is set on a particular organism. The analysis of phylomes has proven to be very powerful in addressing a variety of questions, including the reconstruction of ancestral metabolisms (12), the evaluation of alternative species phylogenies (13,14), the identification of horizontal gene transfers (15), the inference of functional implications of massive waves of gene duplications (6,16) and the detection of orthology and paralogy relationships (17). To facilitate the exploitation of such genome-wide collections of phylogenetic trees, alignments and phylogeny-based orthology and paralogy predictions, we developed phylomeDB (7). Here we describe the new features of the current version of PhylomeDB, including recent improvements in the phylogenetic pipeline used to reconstruct new phylomes. In addition, we highlight the potential use of PhylomeDB resources for the annotation of newly sequenced genomes.
Phylomes stored in PhylomeDB are reconstructed using slight variations of an automated phylogenetic pipeline. This pipeline, first described in (6), is under constant improvement. In addition, each phylome is reconstructed under a specific set of parameters, which mainly depend on the phylogenetic scope of the sequences included. The specific details of how each phylome was generated, as well as the source of the raw data, are provided in the corresponding phylome description page. In brief, the pipeline proceeds as follows: for each protein-coding gene in the seed genome, the pipeline starts with a homology search in a database containing proteins encoded in the seed genome and in a set of selected fully-sequenced genomes. This set defines the taxonomic scope of the phylome and is adjusted to optimally address specific evolutionary questions. Homology searches are performed using the Smith–Waterman algorithm (18) and filtered according to specific e-value and overlap cut-offs. For genes encoding multiple splice variants the longest isoform is used. Subsequently, sets of homologous sequences are aligned. Several changes have been introduced in this step of the pipeline. First, instead of a single multiple sequence alignment method (e.g. MUSCLE), homologous sequences are aligned using three different programs: MUSCLE v3.7 (19), MAFFT v6.712b (20) and DIALIGN-TX (21). Moreover, alignments are performed in forward and reverse direction (i.e. using the Heads or Tails approach (22). The six resulting alignments are then combined with M-Coffee (23). This allows alignments to be trimmed not only based on their gap content but also on the pairing consistency across different alignments, using the program trimAl v1.2 (24). The resulting processed alignment is used to reconstruct phylogenetic trees using Neighbor Joining (NJ) and Maximum Likelihood (ML) methods. In the case of ML reconstruction, an additional improvement has been introduced in the model selection step. Instead of evaluating the likelihood of a tree under each evaluated model by means of a full ML reconstruction, the improved pipeline evaluates the likelihood on the topology obtained by NJ, allowing branch-length optimization. This allows exploring more models, from which only a selection [the best ranking ones according to the AIC criterion (25)] are used for a full ML approach. Confirming earlier observations (6,26), this procedure was found to provide the same result as a full ML approach in >98% of the cases (results not shown).
Although phylomes were initially reconstructed only for publicly available genomes, we soon realized its potential for ongoing genome annotation projects. In particular, the availability of a complete collection of phylogenies for the predicted gene set could be used to reliably assign orthology and paralogy relationships in related species, predict putative functions and detect large family expansions. Moreover, a phylome readily provides a detailed overview of the evolution of the targeted genome, and can be used to address particular questions of interest, such as the phylogenetic position of the seed species, the subset of genes under recent positive selection, the incidence of horizontal gene transfer, etc. Such strategy was applied to the annotation of the pea aphid Acyrtosiphon pisum genome (27), which, to our knowledge, constitutes the first example of a genome whose annotation is based on extensive phylogenetic analyses. The analysis of the A. pisum phylome (16) allowed the computation of the full catalogue of phylogeny-based orthology and paralogy relationships with other arthropod genomes and the assignment of putative functions based on annotated Drosophila one-to-one orthologs. Moreover, it enabled the discovery of a set of interesting gene expansions related to the particular diet and developmental biology of this insect. In another project, a phylome was reconstructed for two strains of the recently-sequenced halophilic bacterium Salinibacter ruber (15). The aim here was to understand the process of sympatric speciation by detecting differences in patterns of horizontal gene transfer, duplication and positive selection between two strains that were isolated simultaneously from the same environmental sample. Other ongoing sequencing projects are currently taking advantage of the PhylomeDB resources to address a variety of questions in different organisms. In this context, a password-protected pre-release section has been created to allow restricted access prior to publication (see below). We encourage interested groups and consortia to contact us for further details.
Current version of PhylomeDB uses ETE v2.0 (28) to visualize phylogenetic trees, and Jalview Lite v2.4 (29) and trimAl v1.2 (24) for the visualization of raw and processed alignments. The visualization of trees and alignments is interactive and users can choose among various display options, such as collapsing parts of a tree or enabling/disabling the display of alternative IDs. Moreover, text strings (e.g. a species or a protein name) can be searched within phylogenetic trees. In addition, further information for each sequence, including hyperlinks to other databases, can be accessed by clicking on the corresponding node.
A new unique ID system has been developed for this version of PhylomeDB. This solves previous issues related to the inclusion of newer versions of existing proteomes and makes the information on the sequence species source more intuitive. In addition, it ensures that the same gene will receive the same ID in subsequent genome versions, unless the sequence has been updated. All PhylomeDB IDs (e.g. Phy0008C1X_HUMAN) start with the code ‘Phy’, followed by an alphanumerical string of length 7, an underscore symbol ‘_’, and an alphanumeric species code. This species code corresponds to that assigned by UniProtKB in the ‘controlled vocabulary of species’ (30) or, when no code is present in UniProt, to the NCBI taxonomic ID (http://www.ncbi.nlm.nih.gov/Taxonomy/). For consistency, older PhylomeDB IDs are still associated to their corresponding sequences and are still searchable. Finally, PhylomeDB IDs are regularly mapped to IDs from other databases such as Ensembl (31) and UniProt (30) and corresponding conversion tables are provided in the download section.
The possibility of external linkage to phylomeDB has been improved and now phylomeDB is linked from many sequence, process and organism reference databases, including UniProt (30), EnsemblCompara (8), Saccharomyces Genome Database (SGD) (32), AphidBase (33), DeathBase (34) and PeroxisomeDB (35). Links to specific entries in phylomeDB are easily customizable and detailed instructions are provided in the help section of PhylomeDB. Thus, phylomeDB can also be regarded as a complementary resource providing evolutionary information for sequences maintained in other databases, and we encourage administrators of other databases to consider this possibility.
A sequence entry is now associated not only with the trees in which that sequence is used as a seed but also to all other trees that contain that sequence. These so-called ‘collateral’ trees may include phylogenies from the same phylome (e.g. trees in which a paralogous protein was used as a seed), but also trees from other phylomes that contain that sequence. This provides users with additional information on the evolution of the sequence of interest and may serve to evaluate whether a given scenario (e.g. an orthology relationship) is supported also by alternative trees. Indeed, partially overlapping phylogenetic trees from PhylomeDB and other phylogenetic databases are explored by the MetaPhOrs database (http://orthology.phylomedb.org) to provide consistency-based confidence scores to phylogeny-based orthology and paralogy predictions (36).
A FTP-based download section has been developed to provide easy access to files containing all alignments, trees and orthology and paralogy predictions associated to every public phylome. In addition, ID conversion tables to UniProt, Ensembl and other major sequence repositories are also provided. Alternatively, for each phylomeDB entry, a compressed folder containing all information associated with the corresponding sequence can be downloaded. Finally, an Application User Interface (API) for accessing phylomeDB is available through the ETE software (28), a python programming toolkit that assists in the automated manipulation, analysis and visualization of hierarchical trees. PhylomeDB interface uses ETE to handle tree manipulation, for the interactive visualization, and to operate with the main MySQL database. By using ETE libraries implemented in the API, users can connect to phylomeDB and search for pre-computed gene phylogenies, download complete phylomes or obtain the orthology and paralogy predictions provided by the database. This allows programmatic access to PhylomeDB, as well as the automation of any downstream analysis. An added advantage of using phylomeDB API is that phylogenetic trees can be directly downloaded as ETE tree objects, thus enabling visualization or complex tree exploration within the same scripts. Interactive web tree visualization is also scheduled to be released soon as part of the ETE package.
A log-in protected private section has been created to store pre-release versions of phylomes, so that they can be used on-line before publication. This option is mainly used now by genome sequencing consortia that generate phylomes within their annotation pipeline. PhylomeDB has currently 13 private phylomes that will be released during the following months.
With 17 public phylomes comprising 416093 phylogenetic trees, the new version of PhylomeDB constitutes one of the major and most comprehensive public repositories of phylogenetic trees (compared to 122002 trees in ensemble, including ensemble compara v59 and ensemble genomes release 5). Regular updates of model-species based phylomes are planned when newest genome releases include significant improvements in terms of sequence quality and coverage. Contrary to most other databases, PhylomeDB follows a gene-centric approach and stores phylomes focused on selected seed genomes. This ensures maximum coverage of the targeted genomes and allows specifically designing the taxonomic scope in order to address different questions. A particular application of PhylomeDB is to provide support for large-scale phylogenetic analyses used in genome annotation projects. This has provided added value in terms of evolutionary insights and allows going beyond standard blast-based automatic annotation of genes. Finally, one of the main aims of PhylomeDB is to provide phylogeny-based orthology and paralogy predictions, covering about 5000000 proteins in 717 fully-sequenced genomes.
Spanish Ministry of Science (GEN2006-27784-E/PAT, BFU2009-09168). Funding for open access charge: CRG, Spanish Ministry of Science and Innovation.
Conflict of interest statement. None declared.