Representative members of the Popeye domain family, the NOX family NADPH oxidases and the eukaryotic V-type ATP synthase beta subunit subfamily were obtained from the UniProtKB/Swiss-Prot release 57.13 and UniProtKB releases 15.14 and 2010_06 according to their annotation. Further homologs were predicted by similarity searches with BlastP on the Expasy Proteomics Server [16
]. The preliminary datasets were complemented by data from UniProtKB/TrEMBL, Ensembl release 57, BeeBase (http://genomes.arc.georgetown.edu/drupal/beebase/
) and WormBase (http://www.wormbase.org/
). A list of sequence identifiers, species names and database identifiers is given in Supplementary Table S1
; sequences are available in the Supplementary Datasets S1–S3
The methods used are summarized below, and a detailed description of each analysis is provided along with the individual phylogenetic trees in Supplementary Figures S4–S6
. The sequence analysis was performed on local computers of the Swiss Institute of Bioinformatics, at phylogeny.fr [17
] as well as at the high performance computing center Vital-IT (http://www.vital-it.ch/
). The sequences of the three data sets were aligned using MUSCLE [19
]. Sequences with gaps within conserved regions were removed and short isoforms were replaced by appropriate ones if available. Gene models were corrected if possible, or otherwise excluded. A multiple sequence alignment (MSA) was constructed with ProbCons [19
], and data models were built through gap removal, Gblocks (stringent and less stringent parameter settings), or manual selection of conserved regions. Phylogenies were inferred using maximum likelihood (ML), Bayesian Markov-Chain Monte Carlo (MCMC) and neighbor joining (NJ). ML-trees were calculated with PhyML [20
] using the amino acid replacement models Jones, Taylor and Thornton (JTT) [21
] or Whelan and Goldman (WAG) [22
], accounting for rate heterogeneity across sites using an eight-category discrete gamma distribution and estimating the shape parameter, and in some analyses the number of invariant sites from the data. Branch support values were calculated with the approximate Likelihood Ratio Test (aLRT) based on a Shimodaira–Hasegawa-like or Chi2
-based procedure [23
]. Bayesian analyses were performed using MrBayes 3.1.2 [24
]. Two independent runs of four chains and one million generations were run using fixed models that performed best when applying PhyML. To test the consistency and robustness of tree topologies, consensus trees were generated from 1000 bootstrap replicates using the BioNJ algorithm [25
] and the JTT model of amino acid substitution. Finally, a consensus tree was constructed considering all the analysis results and species trees as used by TreeBeST (http://treesoft.sourceforge.net
) for chordates, and reconstructed by OrthoDB for arthropods and fungi. The user-defined tree was tested against the ML tree and alternative models using TREE-PUZZLE [26
]. It is noted that even though sequence names used in the reference trees are in the format of UniProtKB/Swiss-Prot entry names, the identifiers are mostly not valid UniProtKB entry names.
Mapping of data from ortholog databases to the reference gene trees
Data was mapped to ortholog groups of the following databases: Compara (Ensembl 56), eggNOG (2.0), HOGENOM (05), InParanoid (7.0), OMA (October 2009), OrthoDB (3); Panther (7.0 beta). Whenever possible, Swiss-Prot cross-references to the phylogenomic databases were used to identify the relevant ortholog groups. Sequence mapping can be hindered for two reasons: (i) UniProtKB frequently uses special taxonomic identifiers for bacterial strains, if the complete genome has been sequenced; and (ii) over 4% of the sequence data is updated during the annotation process, which can prevent the mapping of sequences. We verified such cases manually in order to maximize data matching, taking into account available identifiers for genes, transcripts, and proteins.
For the examples shown here, the comparison between the reference tree and the databases was performed using the browser or the data sets. Here, the purpose was to obtain all relevant gene identifiers and to identify false positive hits for the selected species, which in some cases could not be obtained automatically based simply on the gene identifiers. Sequences of possible false positive hits were analyzed using MSA and tree reconstruction approaches. Consequently, a few more orthologs were identified and added to the reference dataset and tree. Blast services from phylogenomic databases were employed to search for genes that could not be mapped according to gene identifiers. For InParanoid, we obtained the relevant gene identifiers from the InParanoid browser, and extracted the corresponding ortholog information from the database.
For the quantitative analysis, we determined the number of predictions for three types of pairwise gene relationships: orthology, orthology/paralogy and ‘extended’ gene relationships. The latter take into account the number of gene duplications since the last common node of a gene pair. In this manner, a higher resolution for the topological correctness at internal nodes was obtained. We define the ‘extended’ relationships (x, y)-orthology and (x, y)-paralogy, where x and y specify how many duplications took place on the evolutionary path from the point where the two genes in question began diverging. For instance, a pair of orthologs with a single lineage-specific duplication resulting in genes A1, A2 and B are (1,0)-orthologs. Note that this concept is slightly different from the commonly used n:m orthology concept (where n, m is typically ‘1’ or ‘many’): n and m refer to the number of respective co-orthologs, while x and y in the extended gene relationships refer to the number of duplications on the respective paths of the relevant gene pairs since their common ancestry.
Metric and quantitative analysis
Terms used in the context of scoring are defined as follows: ‘True positives’ are predicted gene relationships that coincide with those of the reference model. ‘False negatives’ are gene relationships failed to be predicted according to the reference model and the species list of a given phylogenomic database. The lack of predicted gene relationships can arise from a number of causes. For example, the gene may not be part of the input dataset, the gene model may be incorrect, the gene product may be an isoform, or an ortholog being wrongly predicted as paralog. As the content of databases is benchmarked here rather than the orthology prediction methods, we do not differentiate between these causes. The selection of up-to-date and complete input data is seen as one of the important tasks of phylogenomic databases. False positives are predicted gene relationships which do not correspond to those inferred from the reference tree and which are either outparalogs or not homologs at all. ‘True negatives’ are gene relationships, which are correctly predicted not to be the type of gene relationship in question. ‘Expected OTUs’ (Operational Taxonomic Units) are all relevant genes according to the reference tree and the species list of a database. ‘Mapped OTUs’ are all relevant genes according to the reference tree and the species list of a database. ‘Supplementary gene list’ specifies genes that have not been used in the analysis of the reference tree, e.g. due to incomplete or erroneous gene models. This list is thought to be helpful when automating the benchmarking procedure. Currently, these genes are not considered when calculating scores. However, it is conceivable to annotate some gene relationships based on gene synteny or analysis of small datasets of closely related genes.
A list of all possible gene relationships with annotated orthology/paralogy was created for each reference tree, which was then used as a template to construct database-specific lists by removing genes from non-relevant species. The expected number of orthologous and paralogous relationships was obtained from these lists. The number of true positives, false positives and false negatives was determined from the database results. For pure orthologous groups, only the number of true positive and false positive ortholog predictions could be determined, as no paralogs are specified in this concept. For pairwise groups, the status of each orthologous and paralogous prediction was determined from the groups. For hierarchical groups, we calculated precision and sensitivity for the most specific groups and for trees that were reconstructed according to the hierarchy. Reconciled trees were benchmarked to the hierarchical reference groups.
Extended orthology/paralogy relationships were obtained directly from the reconciled trees. For hierarchical groups, the unresolved trees were reconstructed according to . Specified gene relationships were evaluated on the assumption that branches with non-overlapping taxonomic ranges (i.e. all different species) are orthologs and branches with overlapping taxonomic ranges (i.e. some common species) are paralogs. Unclear gene relationships at multifurcating nodes were set to ‘undefined’. For pairwise groups, the information was considered specified if there was no more than one gene duplication since the last common ancestor for each lineage; in all other cases, the gene relationships were considered ‘undefined’. In case of OMA pairwise groups, the branch of the reference gene was not considered. Extended gene relationships are not calculated from pure orthologous groups that conceptually contain no information on gene duplications. For plain trees, Robinson–Foulds distances were calculated.
Figure 1: Concepts of selected phylogenomic databases. Rows (from top to bottom) indicate the different database concepts, the structure of ortholog groups, the completeness of predicted gene relationships and the implied tree structures. Latter visualizes the (more ...)
Precision and sensitivity were calculated for the three types of gene relationships. The Positive Predictive Value (PPV) is calculated as:
. This score reflects the correctness of the predicted gene relationships, regardless of the size of an ortholog group, the number of family members or the existence of hierarchical levels. The True Positive Rate (TPR) complements the PPV by taking into account the number of false negative hits. TPR is calculated as
All scores were normalized between 0 and 1, with higher values indicating a better fit to the reference tree. The distinction between the quality descriptors PPV and TPR is relevant in systems with a sensitivity-specificity trade-off, as it was observed in earlier benchmarking studies. Consequently, these ratios have not been combined into a single quality score.