Subcolorings are partially ordered by inclusion of domains; the size of a subcoloring is defined to be the size of its domain.
Previous work and motivation for present algorithm
The foundational work in this area was done by Moran and Snir [
3,
4]. Their work is phrased in terms of "convex recoloring," i.e. finding the minimal number of changes in a coloring in order to obtain one that is convex.
It suffices to consider subcolorings for the case of leaf colorings. Indeed, any recoloring can be turned into a subcoloring by removing the color of all of the leaves that get recolored. Conversely, any convex subcoloring can be turned into a convex recoloring in linear time [
6]. For internal nodes, the color to be used for a given internal node is given by the definition of convex coloring. For leaf nodes, simply take the color of the closest colored node. In this equivalence, the number of leaves whose color is removed is equal to the number of leaves who get recolored; thus a minimal recoloring is equivalent to a maximal subcoloring. Because of this equivalence, we only consider subcolorings in this paper.
In [
4], Moran and Snir investigate both the general case of leaf colorings as well as the case of colorings including internal nodes. They also consider non-uniform recoloring cost functions, where a "cost" is associated with recoloring individual nodes and the goal is to find a convex recoloring minimizing total cost. In all settings, they demonstrate that the relevant recoloring problem is NP-hard. They also demonstrate fixed parameter tractablity (FPT) of the problems as described in the next paragraph. In [
3] they present, among other results, a 3-approximation for tree recoloring.
The FPT bound for leaf coloring an
n taxon tree from [
4],
O(
n4 τ Bell(
τ)), comes from an elegant argument using the Hungarian algorithm for maximum weight perfect matching on a bipartite graph. In fact, an inspection of their proof reveals that their algorithm is
O(
n d3 τ Bell(
τ)), where
d is the maximum degree of the tree. Bell(
k) denotes the
kth Bell number, which is the number of unordered partitions of
k items; these numbers are known to satisfy the bounds

[
7]. Their recursion at a given internal node iterates over every unordered partition of the nonconvex colors, constructing a bipartite graph with edge weightings determined by the sizes of subcolorings of subtrees using those color sets for the set of excluded colors. Applying the Hungarian algorithm to each such graph results in optimal solutions for every possible set of colors to exclude from the subtree at that internal node. Because every unordered partition of the non-convex colors is considered, the algorithm is exponential in
τ. For the case of general (i.e. not just leaf) colorings, Moran and Snir show that a dynamic program gives an
O (
n τ dτ+2) algorithm. This of course also gives the same bound for leaf colorings.
The work of Moran and Snir was followed up by many authors. For leaf-colored trees, Bachoore and Bodlaender [
8] propose a collection of reductions to simplify the problem under investigation. These reductions encode some of the logic of the algorithm presented here, such as that trees that have disjoint color sets can be solved independently. They also use the fact that nonconvexity can be expressed in terms of the crossings of paths connecting leaves of the same color to show that the recoloring problem can be solved in
O(
n4
OPT) time, where OPT is the optimal number of uncolored leaves. Note that this sort of bound is different than those described above, as OPT can get large even when the total number of bad colors is small. The work for the general case culminated in the work of Ponta, Hüffner, and Niedermeier [
9], who use the childwise iterative approach to dynamic programing to construct an algorithm of complexity
O(
n τ 3
τ).
For trees built from real data, taxonomic identifiers are not randomly spread across the tree in a uniform fashion. For example, species-level mislabeling will lead to trees that are mostly convex with a couple of outliers, while a horizontal gene transfer will effectively "transplant" one clade into another. In both of these situations there is a non-uniform distribution of taxonomic identifiers across the tree, and nonconvexity in these cases may be local. Indeed, in Figure we show the relationship between the badness β and the total number of bad colors τ for our example trees, showing that the badness β is significantly smaller than the total badness on a collection of phylogenetic trees for non-marker genes. This motivates the search for a fixed parameter tractable algorithm that is exponential in β rather than τ.
Furthermore, phylogenetics is typically concerned with a setting of trees with small degree. For example, many commonly used phylogenetic inference packages such as RAxML [
10] and FastTree [
11] only return bifurcating trees; these sorts of programs are the intended source of trees for our algorithm. Even when multifurcations are allowed, the setting of interest is that of degree much smaller than
β or
τ, which has ramifications for algorithm choice as described below.
Algorithm
In this section we present our algorithm, which makes two improvements over previous work for the parameter regime of interest. First, it restricts attention to cut colors, resulting in an algorithm that is exponential in β rather than τ. Still, such a complete recursion evaluates many sub-solutions that do not end up being used. Because the problem is NP-hard, we cannot avoid some such evaluation, but we might hope to do better than evaluating everything.
This motivates the second aspect, a branch and bound strategy that can make orders of magnitude improvements in the run time of the algorithm. In order to make the branch and bound algorithm possible, the algorithm enumerates all legal color allocations first, and ranks them according to the upper bound function. By bounding the size of a solution for a given color allocation, we can avoid fully evaluating the sub-solution for that color allocation. A simple way of bounding the size of a solution for a color allocation is the maximal size of the solution when convexity is ignored.
Given a rooted subtree T' of T, the root edge of T' in T is the edge connecting the root of T' to the rest of T .
Definition 6. Given a rooted subtree T' of T, define κ(T') to be the colors of χ cut by the root edge T' as it sits inside T.
Assume that T has been embedded in the plane, and that every internal node has been uniquely labeled. For every such label i let t(i) be the ordered tuple of labels of the nodes directly descendant from i in the tree, let T(i) be the subtree below i, and use κ(i) as shorthand for κ(T(i)). Vector subscript notation will be used to index both t(i) and the color set κ-tuples defined next; i.e. t(i)j is the jth entry of t(i).
Definition 7. A color set k-tuple is an ordered k-tuple of subsets of C. They will be denoted with π.
These color set k-tuples will represent the allocation of colors to subtrees. We will ensure convexity of these color allocations using the following two definitions.
Definition 8. Given a color set k-tuple π,
and
Definition 9.
An almost partition
of Y
C is an ordered 2-tuple (
b, π)
where b
C and π is a color set k-tuple such that A(
π) =
Y and B(
π)
![[subset, dbl equals]](/corehtml/pmc/pmcents/x2286.gif)
{
b}.
These will be the color allocations at a given internal node x with color b; this definition guarantees convexity locally, such that the b color is a legal color for the internal node. As described in the Introduction, we would like to primarily restrict our attention to cut colors, but this requires some attention because not all of the colors that need to be allocated at a given internal node are necessarily cut by the edge above that node. This motivates the following definition, which describes how all of the colors that are cut on the edges around a given internal node i are available for allocation at i except for those colors that are in κ(i) but not in X.
Definition 10.
Given i an internal node index and X
κ(
i)
define G(i, X) will be called the colors in play for (i, X).
Definition 11.
Assume we are given an internal node i, X
κ(
i),
and c
C. A legal color allocation
for (
i, c, X)
is an almost partition (
b, π)
of G(
i, X)
such that 1. πj
κ(
t(
i)
j)
2. if c
X then b =
c.
Denote the set of such legal color allocations with Δ(
i, c, X),
and let Δ(
i) =
c, X Δ(
i, c, X).
These color allocations are exactly the set of choices that are allowed when developing a subsolution for a cut set X such that the color c is just above X. The first condition ensures that the color allocation for each subtree sits inside the correct set of cut colors. The second condition says that an internal node must take on any color found above and below it.
Definition 12.
An implicit subcoloring for T' is a choice of (
b(
i),
π(
i))
![[set membership]](/corehtml/pmc/pmcents/x2208.gif)
Δ(
i)
for every i
N(
T')
satisfying the following compatibility property for every k
t(
i):
That is, the color allocation for every node descending from i is a legal color allocation given the choices of b(i) and π(i) made at i.
As described in the Introduction, an implicit subcoloring defines an actual subcoloring via the implicit subcoloring just proximal to leaf nodes. Indeed, say
t(
i)
j is a leaf, and that (
b(
i),
π(
i)) is the color allocation for internal node
i. Then
π(
i)
j is empty or a single element by definition, and the color for leaf
t(
i)
j is used in the subcoloring if
π(
i)
j ≠
![[empty]](/corehtml/pmc/pmcents/empty.gif)
. Every convex subcoloring can be written in this form.
Proposition 1. Implicit subcolorings define convex colorings.
Proof. Assume an implicit subcoloring {(
b(
j),
π(
j))}
i
N(T). Let
χ be the coloring defined by an implicit subcoloring. If
χ is not convex, then there is an edge
e such that

. Say

. Without loss of generality, the colors will be positioned as in one of the two cases depicted in Figure (note that the subtrees marked a, b can contain other colors in addition to these). In case (i), |
B(
π(
i))| ≥ 2, contradicting the definition of an almost partition. In case (ii),
b(
i1) is a by the definition of almost partition because a
B(
π(
i1)). Then
b(
i) = a for every
i between
i1 and
i2, inclusive, by part 2 of Definition 11 and Definition 12. However, b
B(
π(
i2)), contradicting the definition of almost partition. □
With this in mind, we can now speak of the size of an implicit subcoloring as the size of its associated convex subcoloring. The goal, then, is to find the largest implicit subcoloring.
Theorem 1. There is an O(n d β2 2d+β (d-1)dβ/2) complexity algorithm to solve the subcoloring problem for leaf labeled trees.
Proof. For every internal node
i, define the
question domain Q(
i) to be
C × 2
κ(i). An
answer map at internal node
i (resp.
answer size map) is a map
Y → Δ(
i) (resp.
Y →

) for some Y
Q(
i).
We will fill out an answer map
i and an answer size map
ωi as needed at every internal node
i recursively as follows. For a given
i, say we are given a question (
c, X)
Q(
i). If
i is a leaf, then
i(
c, X) =
X and
ωi(
c, X) = |
X|. Otherwise, say there are
descendants of
i. For each (
b, π)
![[set membership]](/corehtml/pmc/pmcents/x2208.gif)
Δ(
i, c, X), find the answers

(
b, πj), and their associated

by recursion. Let
Let
ωi(
c, X) be the maximum value of

, and let
i(
c, X) be the (
b, π) obtaining this maximum, concluding the recursive step. The result of this recursion after starting at the root with every color for
c will be a collection of answer maps for every
i.
These maps define an implicit subcoloring. This can be seen by descending through the tree recursively, using the
i to get almost partitions from questions and passing the resulting questions onto subtrees. Specifically, for question (
c, X) at internal node
i, let (
b(
i),
π(
i)): =
i(
c, X) then recur by passing question (
b(
i)
j, π(
i)
j) to

for every descendant
j. Start at the root (call its index
ρ), pick the color
cρ maximizing
ωρ (
cρ,
![[empty]](/corehtml/pmc/pmcents/empty.gif)
), and begin the recursion with (
cρ,
![[empty]](/corehtml/pmc/pmcents/empty.gif)
).
This subcoloring is maximal by construction. The assertion is clear for 1-leaf trees. Now say that the algorithm finds optimal solutions for all allocations of colors to trees of less than n leaves. Then, given a tree on n leaves and an allocation of colors to the root, the algorithm tries every legal allocation of those colors to each of the subtrees and taxes the maximum thereof. Because every legal color allocation is tried, and the algorithm finds maximal subcolorings for each of the subtrees, the subcoloring for the entire tree must be maximal.
The computation required for a single internal node is as follows. The number of colors in play is bounded above by dβ /2, as each color in play must be cut in at least two edges. Thus, for a given question (c, X) and color b for the internal node, choosing the allocation can take (d - 1)dβ/2 steps for the colors other than b, while deciding where b is present can take 2d trials. There are at most β2β choices of question and dβ/2 choices of internal node color for a given internal node.
There are O(n) internal nodes, giving the bound. □
An upper bound for

can be used to construct a branch and bound recursion as follows.
Algorithm 1 (Branch and bound recursion to find optimal implicit subcoloring).
Assume a function
for all (
b, π)
![[set membership]](/corehtml/pmc/pmcents/x2208.gif)
Δ(
i).
Proceed as in the proof of Theorem 1, with the following modification. For a given internal node i with c
C and X
κ(
i),
find
i(
c, X)
as follows: 1. sort the elements (b, π) of Δ(i, c, X) in decreasing size with respect to vi.
2. proceed down this ordered list as follows, starting with q = 0:
(a) compute 
(b) call the next item in the ordered list (b', π'). If q ≥ vi (b', π') then stop, otherwise recur to (a)
3. let
i(
c, X)
be the (
b, π)
corresponding to q.
The correctness of this algorithm follows directly from Theorem 1, as the only solutions that are thrown away are strictly sub-optimal.
A simple upper bound is the number of leaves that could be used given the restrictions in
π but ignoring convexity. That is, let

be the number of leaves of
T(
i) with colors in
X. Then define

. This upper bound gives significant improvement in time used over the algorithm in Theorem 1 (Figure ).