Supertree estimation methods need to be both highly accurate and also reasonably fast, as otherwise they will not be useful in estimating large phylogenies. Our discussion thus addresses both running time and topological accuracy.
The results for the simulated datasets show clearly that all the methods produce trees with about the same accuracy on datasets with very dense scaffolds, but differ substantially in terms of accuracy on the datasets with sparser scaffolds. Since sparser scaffolds are common for biological supertree inputs, the differences in accuracy on sparse scaffolds is important.
In general we found that all the SuperFine variants we studied (whether using MRL or MRP to refine polytomies in the SCM tree) produced very accurate trees, and that differences between them were largely in terms of running time, or with respect to MRL, MRP, or Sum-FN score. With respect to running time, SuperFine+MRP(TNT) was the fastest of all the methods we studied, finishing in at most 4 minutes on all the datasets (including the largest one with 2228 taxa and 39 source trees). Furthermore, SuperFine+MRP(TNT) produced very good MRP and MRL scores, outperforming TNT and PAUP* with respect to MRP score optimization. On the biological datasets, we also observed similar results, including that SuperFine+MRP(TNT) generally produced very good Sum-FN scores. Thus, although SuperFine+MRP(TNT) was not designed to be a heuristic for any of these criteria, it has excellent performance across the board.
It is worth discussing in greater depth the results we showed for Sum-FN scores. Our study shows that neither MRP nor Sum-FN scores have the best correlation with tree error, except when the scaffold factor is very dense. This result suggests that optimizing MRP or Sum-FN may not be the best strategy (except with dense scaffolds), and that evaluating supertree methods with respect to Sum-FN may not be the best way of distinguishing methods (except for dense scaffold datasets, perhaps). These observations were made earlier in [24
], but are worth repeating here, because of the increased interest in an approach to supertree estimation proposed by Steel and Rodrigo [15
], called "maximum likelihood supertrees". This method is based upon an exponential error model, and can be based upon different ways of measuring distances between trees and weights on the input trees. However, in the simplest case, where the weights on trees are all the same and the distance between trees is the RF distance, finding the ML supertree is identical to optimizing Sum-RF (minimizing the total topological distance, using Robinson-Foulds scores, to the input trees), a criterion that is almost identical to Sum-FN. Indeed, when the input estimated trees are binary, these criteria are exactly the same. Since our simulation study estimated supertrees from binary source trees, our correlation analysis also shows that optimizing Sum-RF is not likely to be the best strategy, except for dense scaffold datasets, and thus suggests that the use of RF distance metric within the ML supertree approach proposed by Steel and Rodrigo may not be appropriate. We note here a potential shortcoming of the ML supertree approach in general: it seems likely that the probability of a particular estimated tree will not depend only on the topological distance it has to the true tree, but rather also on the parameters of the true tree (especially the branch lengths), since very short branches are more likely to fail to be recovered in a phylogenetic estimation than longer branches.
A fundamental observation in this study is that searching for supertrees that optimize the maximum likelihood score under the S2+CAT model improved tree accuracy, a trend that we found quite surprising. The MRP matrix is a collection of partial binary characters defined by the input source trees. When these trees are compatible, the MRP matrix will exhibit no homoplasy at all, a condition under which the MRP solution will yield the true tree. Therefore, when there is no homoplasy, the ML solution under a no-common-mechanism model [33
] (in which every combination of edge and site has its own rate parameter) will also produce the true tree, since then ML and MP produce the same trees. However, standard ML models (including the model used in this study) assume i.i.d
. rates across sites, which does not yield the same result. Thus, we do not have a theoretical explanation for why optimizing likelihood under S2+CAT should lead to good supertrees. All we can say is that the data suggest that there may
be some value (even if only approximate, and perhaps only under some conditions, not yet understood) in using maximum likelihood under this model as an optimization criterion for estimating supertrees. Future work should investigate whether optimizing the MRL score continues to return good solutions when the source trees are estimated from sequences that evolve under more realistic conditions, including indels, heterotachy, and non-stationarity.
As has been noted in [34
], supertree analyses are not always able to completely identify the true tree, because the conditions required for such identification include correct source trees and overlap properties that may not be true of any given set of source trees. However, alternatives - such as combined analyses, in which a phylogeny estimation method is applied to a concatenation of the gene sequence alignments - also have only limited guarantees. From a practical standpoint, the evidence suggests that while combined analyses can yield more accurate trees [20
] than supertree methods, there are conditions in which combined analysis methods cannot be used (e.g., heterogeneous data, including morphology, gene orders, or different types of molecular data), or are simply too computationally intensive. In these cases, improved supertree methods can be important tools in the phylogenetics toolkit.
In summary, this study introduces a new set of supertree methods based upon combining the divide-and-conquer strategy within SuperFine with fast supertree methods. In particular, the combination of SuperFine with TNT is extremely fast and produces very accurate supertrees, even on the largest datasets we studied. Earlier work [24
] showed that SuperFine (based upon MRP, and using PAUP*) came very close to the accuracy of combined analysis based upon maximum likelihood. Future work should investigate statistical approaches to supertree estimation (along the lines of maximum likelihood supertrees, but taking branch lengths or support into account). The combination of SuperFine with such statistically-based supertree methods might close the gap between combined analysis and supertree methods.