Molecular phylogenies of many taxonomic groups are based on analyses of single loci. While this approach has led to important insights into the evolution of many groups of interest (consider, as an extreme example, Källersjö et al. [1
]), it is also hampered by a number of potential difficulties. For instance, due to effects such as horizontal gene transfer, hybridisation, lineage-sorting, paralogous genes, and pseudogenes, gene trees and species trees do not always agree [2
Furthermore, length and, hence, information content of individual genes is limited, sometimes causing a lack of resolution in the inferred trees. Saturation is an important problem, in particular if the resolution of relationships between major groups of organisms("deep phylogeny") is aimed at [3
]. Nowadays, an increasing number of completely sequenced genomes are available and a growing field of phylogenetic research deals with the question of how to infer reliable phylogenies from this large amount of data to overcome the limitations of single-gene phylogenies.
A relatively obvious approach to phylogenetic analysis of whole genomes is to extract as many genes as possible from the genome sequences, create a multiple sequence alignment from each of the genes and to concatenate all alignments. Datasets in the order of 100, 000 base pairs have been compiled in this way (e.g., [2
]). Such datasets can be analysed using the same phylogenetic inference tools as single loci datasets.
Difficulties with this approach may arise if orthologous genes cannot be identified with certainty or if the combined sequence length is still too small to give well-resolved trees. Furthermore, the use of concatenated multiple sequence alignments discards information that can be utilised by other methods of phylogenetic inference. For instance, methods that infer trees based on gene content [5
], gene order [8
], or content of protein orthologs and folds [11
]. When applied to prokaryote phylogeny, these different methodological approaches lead to quite different results [12
]. A further loss of information in the concatenated multiple sequence alignment approach may be caused by regions which have to be discarded since they cannot be aligned with certainty [13
In contrast, a third group of methods does not require to specify genes or orthologs in advance, to create multiple sequence alignments, and to discard unalignable regions, but is able to generate a distance matrix directly from complete genome sequences. Trees can then be inferred using any of the standard distance-based phylogenetic methods (e.g., [14
]), even though phylogenetic networks [16
] may be a more powerful way to explore such distance data. Some of these approaches use differences in word-count frequencies [18
], complexity-based measures [19
] or breakpoint analysis [20
] to derive pairwise distance functions.
The methods of particular interest to us in this paper [21
] rely on identification of local regions of high sequence similarity between two genomes, this is usually done with the popular tool BLAST [24
]. Henz et al. [23
] recently described the "Genome BLAST Distance Phylogeny" (GBDP) approach and applied it to deep prokaryote phylogeny. In brief, GBDP works by finding a set of high-scoring segment pairs (HSPs) between each pair of genomes, deriving a distance function from these sets, and building a tree or a network using algorithms like UPGMA [25
], NJ [26
], BIONJ [28
] or Neighbor-net [16
Statistical support of individual branches within trees inferred from multiple sequence alignments is usually assessed by bootstrapping [29
], which assumes a number of statistically independent individual characters. Similar to some other less commonly used but valuable (and, hence, perhaps underused) phylogenetic methods such as elision [13
], direct optimisation [31
], fixed-states and search-based optimisation [32
], or pair-wise distances between unaligned sequences from single loci [34
], the above-mentioned genome distance methods cannot readily be combined with the bootstrap since the whole genome is treated as a single character.
In our view, this potential disadvantage is outweighed by the fact that distance methods may be combined with phylogenetic network techniques, which have some distinct advantages over bootstrapping (e.g., [16
]). For instance, bootstrapping cannot distinguish between conflicting signal and low amount of signal, and bootstrapping cannot identify "rogue taxa" (e.g., [39
]). Furthermore, many evolutionary processes are better represented by networks than by trees [17
]. Network techniques are better suited than bootstrapping to detect systematic error in phylogenetic analyses, particularly in very large datasets such as genomescale data [17
]. Neighbor-net is also much faster than even Neighbor-joining bootstrapping [16
]. Since distance methods such as GBDP may also directly use complete genome sequences, their combination with network techniques may be more efficient than bootstrapping of concatenated multiple sequence alignments.
The present article builds on the work of Henz et al. [23
] and extends it in several ways. Here, we apply GBDP to completely sequenced plastid and mitochondrion genomes to infer relationships of major eukaryotic groups. Plastid and mitochondrion genomes are highly, sometimes extremely, reduced, and are subject to evolutionary conditions quite different from prokaryote genomes. We were thus interested in whether GBDP would perform as well as with genomes of prokaryotes [23
], and if so, under which conditions. Completely sequenced plastid genomes have been used in a number of articles (e.g., [43
]) to infer phylogenetic relationships based on sequence alignments of many concatenated genes, enabling us to directly compare the GBDP results with respect to, e.g., recovery and placement of major eukaryotic groups and location of primary and secondary endosymbiosis events.
We also examine additional modifications of GBDP. A new distance function based on sequence identity within HSPs is introduced. Different formulae for creating symmetric similarity scores from the asymmetric results of BLAST comparisons are examined, as well as two different formulae to derive distances from similarity values. We also investigate the use of protein-protein BLAST (WUTBLASTX [24
]) instead of nucleotide-nucleotide BLAST (NCBI-BLASTN [48
]) and two ways of combining the two methods of HSP search. Accuracy of trees inferred from GBDP distances by three well-known (UPGMA, NJ, and BIONJ) and two recently described reconstruction methods (STC [49
] and FastME [50
]) is measured by comparison with current NCBI taxonomy based on c-scores [23
]. The c-score is defined as the number of non-trivial splits in the phylogenetic tree under study which are compatible [51
] to the reference topology divided by the total number of non-trivial splits in the test tree. These compatible splits are either already included in the reference topology, or a refinement of the topology, but do not conflict with it. The c-score's denominator is useful to correct for, e.g., a different number of taxa or a different amount of resolution in the test trees. The main factors increasing or decreasing GBDP accuracy were determined by multiple regression analysis with c-score as dependent variable.
Holland et al. [52
] described a statistical geometry approach to estimate the departure of a distance matrix from the additivity condition [53
], i.e., the degree to which it is not treelike, by computing so-called δ
values for all quartets of taxa. A similar approach is the Q criterion of Guindon and Gascuel [54
], which is also computed from taxon quartets and can be used to assess the treelikeness of a distance matrix. As most distance methods are guaranteed to infer the correct tree from completely additive distances, distance matrices with the least departure from additivity should be preferable [14
]. An additional advantage of δ
values is that they are, in contrast to, e.g., c-scores, independent of any preconceived hypothesis on how the true phylogeny looks like. We thus examined quality of each GBDP distance matrix in phylogeny reconstruction directly by measuring its mean δ
value. As an empirical investigation of the approach described by Holland et al. [52
], suitability of δ
values in predicting phylogenetic accuracy could then be assessed by regression analyses.