Here, we propose a reconciliation model based on DTL parsimony that distinguishes between regions of the species tree where ILS is likely, and those where only gene duplication and transfer need be considered. These differences are specified using a non-binary species tree: at binary nodes, we assume that ILS is so rare that incongruence is always evidence of gene duplication or transfer. At polytomies, ILS is considered, and gene duplication and transfer are invoked only if topological disagreement cannot be explained by ILS. This model can be invoked for both non-binary species trees and for binary species trees with short branches where ILS is suspected: even when the binary branching order of the species tree is known, the user can collapse edges in the species tree to indicate in which lineages ILS should be considered as an alternate hypothesis.

A key aspect of our model is that even when ILS is allowed, it is not possible to explain all incongruence in terms of ILS, even in a uniquely labeled gene tree. Let

*g* be a node in

*T*_{G} and let

*s* *V*_{S} be the associated node in the species tree. We wish to determine whether the divergence at

*g* is consistent with a co-divergence at

*s* or whether it can only be explained by events that give rise to a new locus; i.e. duplication and transfer. If the branch point at

*g* arose through a co-divergence with

*s*, then each species lineage descending from

*s* should inherit at most one descendant of

*g*. The presence of more than one descendant of

*g* indicates that the divergence at

*g* must be due to acquisition of an additional locus by duplication or transfer. An operational test for detecting more than one descendant on a branch results from the observation that any branching pattern that is consistent with a binary resolution of the polytomy can be explained by lineage sorting.

For example, the gene tree in represents a valid, binary resolution of the species tree, consistent with ILS. The embedding of the gene tree in the species tree shows that each species tree lineage inherits exactly one descendant of *x*_{1} and at most one descendant of *x*_{2}. Both *x*_{1} and *x*_{2} can be interpreted as deep coalescences. In contrast, there is no binary resolution of the species tree that corresponds to the gene tree in . The embedding of this gene tree requires two descendants of *y*_{2} in the lineage from *e* to *f*, a violation of model constraints. The only way to explain two descendants of *y*_{2} on the branch from *e* to *f* is by inferring a duplication () or a transfer ().

2.1 The DTLI algorithm

In our DTLI model, divergence in a gene tree arises through one of four events: duplication (

), transfer (

), speciation (

) and deep coalescence (

). The score of a reconciliation under this model is the weighted sum of the number of duplications (

), losses (

), and transfers (

):

where

*δ*,

*λ* and

*τ*, respectively, are the costs of duplication, loss and transfer. Speciation and deep coalescence represent co-divergence with binary nodes and polytomies, respectively, in the species tree and have zero cost. We refer to the cost of event

*ε* {

,

,

,

} as

*κ* (

*ε*).

A rooted, binary gene tree

*T*_{G}; a rooted, arbitrary species tree

*T*_{S}; a mapping

*M*_{L} :

*L*(

*V*_{G}) →

*L*(

*V*_{S}) from contemporary genes to the species from which they were sampled and a set of permitted events are given as input. The reconciliation of

*T*_{G} with

*T*_{S} results in an annotated tree,

*R*_{GS} = (

*V*_{G},

*E*_{G}), in which every internal node,

*g*, is annotated with the species

*s* *V*_{S} that contained gene

*g*, designated

*M*(

*g*), and the event that caused the divergence at

*g*, designated

. In addition, every

*g* *V*_{G} \ {

*ρ*_{G}} is annotated with

, the genes lost on the edge from

*P*(

*g*) to

*g*. Each loss is labeled with the species in which the loss occurred. We say (

*u*,

*v*)

*E*_{G} is a transfer edge if

and

*M* (

*u*)

_{S}
*M*(

*v*) and define Λ(

*R*_{GS})

*E*_{G} to be the set of transfer edges in

*R*_{GS}. If (

*u*,

*v*)

Λ(

*R*_{GS}), a transfer occurred from donor species

*d* =

*M*(

*u*) to recipient species

*r* =

*M*(

*v*).

Here, we present the DTLI event inference problem under the constraint that a deep coalescent is inferred at *g* if each lineage descending from *M*(*g*) inherits at most one descendant of *g*:

**The DTLI event inference problem**
**Input:** A rooted non-binary species tree, *T*_{S}; a rooted, binary gene tree, *T*_{G}; the leaf mapping, *M*_{L}.**Output:** All reconciliation histories *R*_{GS} that minimize *π* and satisfy the model constraints.

Algorithms for the DTLI event model must address several issues that do not arise when only a subset of the events is considered: (1) there may be more than one combination of duplications, transfers and losses that gives rise to the same pattern of tree incongruence (i.e. there may be more than one optimal solution,

*R*_{GS}). (2) The value of

*M*(

*g*) is not uniquely determined by the children of

*g* and multiple possible values of

*M*(

*g*) must be considered because transfers cause genes to jump to distant locations in the species tree. (3) An optimal reconciliation at the root may entail a suboptimal reconciliation at an internal node,

*g*. Inferring a more costly event at

*g* may change the values of

*M*(·) in nodes ancestral to

*g* such that the overall score is reduced. Therefore, the values of

*M*(

*g*) and

required for an optimal solution cannot be determined using only local information, and more than one optimal solution may result.

To accommodate these requirements, it is necessary to enumerate all possible assignments of

*M*(

*g*) and

, for each node

*g* *V*_{G}. At each

*g*, the associated information is stored in two tables,

and

. For each candidate assignment

*s* *V*_{S}, the score that minimizes the cost of reconciling

*T*_{G}(

*g*) with

*T*_{S}(

*s*), is stored in

. The associated events and other information needed to reconstruct the history at

*g* are stored in

.

Optimal reconciliations are calculated by a two-pass algorithm. The first pass (Algorithm 2.1.1) is a dynamic program that populates each

and

in a post-order traversal of

*T*_{G}. It returns the optimal reconciliation score, the values of

*M*(

*ρ*_{g}) and

corresponding to that score and the number of optimal histories. The second pass (

Supplementary Algorithm S1.0.1) is a traceback algorithm that reads information from each

to construct an optimal solution. Each optimal history is generated by traversing, in pre-order of

*T*_{G}, each unique path that leads to the optimal label(s) in

. Appropriate values of

*M*(

*g*) and

at each node

*g* are selected from

. Each candidate optimal history is then tested for temporal feasibility, as described in the next section. Only those histories that are temporally feasible are reported.

A key calculation in the dynamic program of firstPass is determination of the possible events at

*g* for a given candidate species assignment,

*M*(

*g*)=

*s*. These events, in turn, depend on

*M*(

*c*_{1}) =

*s*_{1} and

*M*(

*c*_{2}) =

*s*_{2}, where

*c*_{1},

*c*_{2} *C*(

*g*). The basis for determining candidate events that are consistent with

*s*,

*s*_{1} and

*s*_{2} is the following observation: if a duplication occurred at

*g*, then the species that inherit the descendants of

*c*_{1} and the species that inherit the descendants of

*c*_{2} will not be disjoint.

We define a test, based on this observation, for distinguishing duplication from other events:

where

is the set of species that vertically inherit descendants of

*P*(

*g*). If

and

are disjoint, than one of the other three events (

,

or

) must have occurred. These events can be distinguished from one another using

,

*M*(

*g*) and

*M*(

*c*_{1}) and

*M*(

*c*_{2}), as seen in costCalc in Algorithm 2.1.1. Note that

Equation (2) is different from the standard least common ancestor (lca) test; however, when

*M*(

*g*)=

*s* is binary, the descendants of

*s* are partitioned into two sets, the left and right descendants of

*s*, if there is no duplication. Therefore,

Equation 2 is equivalent to lca reconciliation (

Vernot *et al.*, 2008).

Because

only consists of elements that were vertically inherited, we must exclude transfer edges in the calculation. For this purpose, we define

the set of leaves of

*T*_{G}(

*g*) that were acquired through HGT. Formally, we define

to be a mapping from

*V*_{G} to sets of nodes in

*V*_{S}, where

*V*_{S}^{+} is the powerset of

*V*_{S}.

is the set of children of

*M*(

*P*(

*g*)) such that

; otherwise,

One more piece of machinery is needed: to determine

, we must know the children of

*M*(

*P*(

*g*)), but we do not have that information until we visit

*P*(

*g*). Therefore, we define a similar set mapping,

, to aid in the calculation of

.

is the, set of children of

*M*(

*g*) that vertically inherit a descendant of

*g*. Formally, if

*M*(

*g*)

*L*(

*T*_{S}),

; otherwise,

Algorithm 2.1.1 traverses

*T*_{G} in post-order calling calcCost at each

*g* *V*_{G}. The challenge in the DTLI model is to determine the sets of species that inherit the descendants of

*c*_{1} and

*c*_{2} when

*M*(

*g*) =

*s* is a polytomy; i.e. how to calculate

and

. When

*s* is binary, the descendants of

*s* are easily partitioned into two sets; when

*s* is a polytomy, all possible ways to partition the descendants must be considered. Each child of

*g* can be retained in any subset of the children of

*s*, ranging from size 1 to |

*C*(

*s*)| − 1. Our DTLI algorithm addresses this by considering all ways of partitioning

*C*(

*s*) into two non-empty subsets.

At each internal node

*g*, the algorithm assesses all possible values for

*M*(

*g*) and

by looping through all (

*s*_{1},

*s*_{2})

*V*_{S} ×

*V*_{S} and all

. Considering all power sets corresponds to considering all the ways to partition

*C*(

*s*_{1}) and

*C*(

*s*_{2}). The optimal event and child mapping under

*s* and

is determined by minimizing the cost of the candidate solution at

*g*:

where

, the number of losses on edge (

*g*,

*c*_{i}), is calculated using the loss heuristic in (

Vernot *et al.*, 2008). Note that for each

*s*, the local cost and history tables are also indexed by all possible values of

, which are in

*C*(

*s*)

^{+}.

2.2 Temporal infeasibility

Because the donor and recipient species of any transfer must have coexisted, each transfer implies a temporal constraint. A reconciliation is temporally feasible if an ordering of species exists that satisfies the constraints of all inferred transfers. Because reconciliations inferred by Algorithm 2.1.1 are not guaranteed to be feasible, each candidate optimal solution is tested for feasibility *post hoc*.

To determine whether a reconciliation

*R*_{GS} is temporally feasible, we construct a directed timing graph

*G*_{t} = (

*V*_{t},

*E*_{t}) that encodes all temporal constraints on species in

*T*_{S}. Only species that are the donor,

*d*, or recipient,

*r*, of a transfer edge in Λ(

*R*_{GS}) must be considered. Thus, the vertex set is defined as

*V*_{t} = {

*v* *V*_{S}|

(

*g*,

*h*)

Λ(

*T*_{G})

*v* =

*M*(

*g*)

*v* =

*M*(

*h*)}.

The edges in

*E*_{t} represent three types of temporal constraints:

- If species
*s*_{i} is an ancestor of species *s*_{j} in *T*_{S}, then *s*_{i} predates *s*_{j}: for every (*s*_{i}, *s*_{j}) in *V*_{t} × *V*_{t}, add (*s*_{i}, *s*_{j}) to *E*_{t} if *s*_{i} ≥_{S}
*s*_{j}. - Let (
*g, h*) and (*g*′, *h*′) be transfers in Λ(*R*_{GS}), such that *g* ≥_{G}
*g*′. Then *d* = *M*(*g*) and *r* = *M*(*h*) must have occurred no later than both *d*′ = *M*(*g*′) and *r*′ = *M*(*h*′). We add (*P*(*d*), *d*′), (*P*(*d*), *r*′), (*P*(*r*), *d*′) and (*P*(*r*), *r*′) to *E*_{t} . - Given a transfer (
*g, h*) Λ(*R*_{GS}), species *M*(*g*) and *M*(*h*) must be contemporaneous. Furthermore, any species that predates *M*(*g*) must also predate *M*(*h*) and vice versa. For every (*s*_{i}, *s*_{j}) *V*_{t} × *V*_{t}, add (*s*_{i}, *s*_{j}) to *E*_{t} if *s*_{k} *V*_{t} such that *s*_{i} ≥_{S}
*s*_{k} and *s*_{k} and *s*_{j} are the donor and recipient, or vice versa, of some transfer (*g, h*) Λ(*R*_{GS}).

We test each candidate optimal history for temporal feasibility by verifying that the associated timing graph

*G*_{t} is acyclic, using a modified topological sorting algorithm in Θ (|

*V*_{t} |+|

*E*_{t} |) (

Cormen *et al.*, 1990). Temporally infeasible histories are not reported. Note that it is not the case that if one optimal history is infeasible, all optimal histories are infeasible. Finding the optimal, temporally feasible reconciliation is NP-complete (

Tofigh *et al.*, 2011); we leave the problem of obtaining an optimal, feasible solution when all candidate solutions have infeasible timing constraints for future work.

2.3 Complexity and running time

Our algorithm is fixed-parameter tractable with polynomial complexity when the size of the largest polytomy,

*k**, is fixed. In practical data analyses,

*k** is likely to be small. Recent genome-scale analyses of ILS have focused on species trees with

*k** = 3 (

Ebersberger *et al.*, 2007;

Pollard *et al.*, 2006). In general, event inference will not yield informative results when the species tree is highly unresolved.

Theorem 2.1. *Given a binary gene tree T*_{G} and a non-binary species tree T_{S}, firstPass takes O(|*V*_{G}|(|*V*_{S}|+ *n*_{k} 2^{k*})^{2}(*h*_{S} + *k**)) *time*.

P

roof. firstPass visits each

*g* *V*_{G} in post order. At each

*g*, costCalc is called once for every (

*s*_{1},

*s*_{2})

*V*_{S} ×

*V*_{S} and

, resulting in a total of

*O*(|

*V*_{G}|(|

_{sVs} C(

*s*)

^{+}|)

^{2}) calls to costCalc. Because |

*C*(

*s*)

^{+}| = 2

^{|C(s)|} is

*O*(1) when

*s* is binary, |

_{sVs} C(

*s*)

^{+}| is bounded above by |

*V*_{S}|–

*n*_{k} +

*n*_{k}2

^{k*} and the number of calls to costCalc is

*O*(|

*V*_{G}|(|

*V*_{S}|+

*n*_{k}2

^{k*})

^{2}). We precalculate lca(

*s*_{1},

*s*_{2}) and test whether

*s*_{1}*s*_{2}, for all species pairs, in

*O*(|

*V*_{S}|

^{2}) time. Therefore, the complexity of costCalc is dominated by the calculations of

for

*l* and

*r*,

and

. These values can be computed in

*O*(

*h*_{S}),

*O*(log(

*k**)) and

*O*(

*k**) time, respectively. Thus, each call to costCalc has complexity

*O*(

*h*_{S} +

*k**). Once the post-order traversal is completed, we extract the minimum score in

, and all values of

*M*(

*ρ*_{G}) and

corresponding to that score. Since

, a linear search accomplishes this in

*O*(|

*V*_{S}|+

*n*_{k}2

^{k*}) time. Thus, the total complexity is

*O*(|

*V*_{G}|(|

*V*_{S}|+

*n*_{k}2

^{k*})

^{2}(

*h*_{S} +

*k**)).

Theorem 2.2. *secondPass returns each optimal reconciliation in O*(|*V*_{G}|(*h*_{S} + *k**)).

P

roof. secondPass starts from the

*M*(

*ρ*_{G}) and

found in firstPass. It then constructs an optimal solution by visiting each subsequent

*g* *V*_{G}, assigning mappings and events by looking up values in

in constant time. Losses are inferred in

*O*(

*k** +

*h*_{S}) time (see

Vernot *et al.*, 2008). Thus, the complexity for returning each optimal history is

*O*(|

*V*_{G}|(

*h*_{S} +

*k**)).

When *T*_{S} is binary, firstPass is completed in *O*(*h*_{S}|*V*_{G}||*V*_{S}|^{2}) time, and secondPass reports each optimal solution in *O*(*h*_{S}|*V*_{G}|) time.

Our Notung implementation is efficient in practice. We measured the time required to reconcile 1128 cyanobacterial gene trees with a species tree of size |*V*_{S}| ≤ 21 for all the parameter settings given in . To assess the effect of polytomy size, we also collapsed edges in the species tree to create a polytomy ranging in size from 2 to 6. The maximum average running time observed on a single AMD Opteron 2.3 ghz, 64-bit processor was ~ 0.05s. per solution.

| **Table 1.**Event counts for the cyanobacteria dataset, with *δ* = 3 and *λ* = 2 |