We have evaluated the potential of Z-closure and Q-imputation filtered supernetworks to identify splits belonging to the sets of principal trees associated with hybridization networks. We have found that this approach can recover these splits when there are few hybridization events. However, our results imply that (1) if gene trees have many missing taxa then many gene trees are required; (2) if the gene trees are frequently incongruent with the principal trees of the hybridization network due to incomplete lineage-sorting then a large number of near complete gene trees is required; (3) and if there are few gene trees available they need to be both near complete and highly congruent with the principal trees.
In our simulations the counting filter picked the n best-supported splits, where n was chosen to be the known number of true non-trivial splits. Of course with real data n will not be known, although in practice n could be chosen by, for example, greedily introducing splits with highest support as long as the corresponding network does not become too complex to easily interpret. Approaches to do this are described in [40
] for consensus networks. Note that by increasing n the risk of introducing false positive splits is increased, although the risk of failing to identify true positive splits is reduced.
Despite these limitations, with the potential now of obtaining large numbers of splits from independent gene loci using new generation sequencing technologies, our findings may nevertheless be applicable for tree-like phylogenies where some degree of hybridization is inferred [41
]. In such cases, filtered supernetworks can be used to identify the true splits of the underlying hybridization network. Once these are obtained, the method of [5
] can be used to convert the split system into a hybridization scenario.
One of our most interesting findings is that the choice of whether to use Z-closure or Q-imputation seems to have much less impact on accuracy with regards to recovering the splits in the underlying hybridization network than the choice of filter. For both Q-imputation and Z-closure the counting filter (CF) has the desirable property that as the amount of data increases (more genes or more complete gene trees) the rate of both false positives and false negatives goes down. Several settings were tried for the homoplasy-based filter (HF1 – HF4). HF1 was too stringent, and HF3/HF4 tended to either suffer from increasing false positives or increasing false negatives as the number of gene trees increased. HF2 gave the best compromise between these extremes.
Using the HF2 filter, we found that Z-closure had a higher false positive rate than Q-imputation over a range of parameter combinations (Figure ). One explanation might be that Z-closure can potentially generate more splits than Q-imputation. For example, given g fully resolved gene trees on 8-m taxa (m = 1, 2, 3), Q-imputation can generate at most 5*g non-trivial complete splits, whereas Z-closure can produce at most 10*(5-m)*g non-trivial complete splits (where 10 is the number of random orderings). Hence the maximum number of splits that Z-closure could generate decreases as m grows, whereas the number of splits that Q-imputation could generate stays constant. What we observe for both methods and filters is that false positives increase with increasing m (Figure ). Therefore, the maximum number of splits that Z-closure and Q-imputation could generate does not appear to explain the difference in false positive rates. We think a more likely explanation is that Q-imputation places missing taxa in such a way as to maximize agreement with the input trees, hence tending to produce multiple copies of the same splits. Conversely, Z-closure aims to find all possible complete splits that can be derived by extending partial splits using the Z-closure rule, a process that can yield many different splits. Hence we expect that Q-imputation would be likely to generate fewer false positives than Z-closure in general. This difference is not greatly reduced by HF2 as, in contrast to CF, it does not place a cap on the total number of splits.
We found that HF2 resulted in more false negatives than CF (Figure ). This may be due to the fact that this filter only selects splits that have no homoplasy when restricted to 75% of the input trees. Since the principal trees are obtained from a network, they can be different and in some cases may only agree on a small number of edges. Even a true split may have a high homoplasy score when restricted to a particular principal tree. In contrast, CF only selects splits that occur with high frequency, irrespective of whether they are in agreement with any of the input trees.
Although all the trees used in our simulations were fully resolved, both supernetwork methods considered here can be applied to partially resolved trees. Thus, when inferring gene trees to be used as input to a supernetwork method, it would probably be a reasonable approach to only retain those edges in the estimated gene trees that have high support (e.g. bootstrap support or posterior probability higher than some cut-off value).
In cases where there are many hybridization events, especially between individuals that are not closely related, there will be many principal trees and corresponding splits (as in hybridization network 10). Many of these splits will occur at low frequencies making them hard to distinguish from phylogenetic error. This means that phylogenetic inference will be limited, as gene-tree incongruence will be extensive. In such cases, rather than attempt to reconstruct a hybridization network, it may be more appropriate to formulate objective tests to better understand the complexity of the data and the extent to which hybridization contributes to this complexity. Joly, McLenachan and Lockhart (submitted manuscript) have recently proposed such a test.
An unexplored idea worthy of study is the investigation of model-based, rather than combinatorial, methods of filtering. One approach might be to consider posterior probability distributions on species trees [42
]. It will be interesting to investigate whether such posterior distributions can also be analysed for evidence of distinct principal trees in cases where evolutionary relationships are complex.