|Home | About | Journals | Submit | Contact Us | Français|
Neighbor-joining is one of the most widely used methods for constructing evolutionary trees. This approach from phylogenetics is often employed in population genetics, where distance matrices obtained from allele frequencies are used to produce a representation of population relationships in the form of a tree. In phylogenetics, the utility of neighbor-joining derives partly from a result that for a class of distance matrices including those that are additive or tree-like—generated by summing weights over the edges connecting pairs of taxa in a tree to obtain pairwise distances—application of neighbor-joining recovers exactly the underlying tree. For populations within a species, however, migration and admixture can produce distance matrices that reflect more complex processes than those obtained from the bifurcating trees typical in the multispecies context. Admixed populations—populations descended from recent mixture of groups that have long been separated—have been observed to be located centrally in inferred neighbor-joining trees, with short external branches incident to the path connecting their source populations. Here, using a simple model, we explore mathematically the behavior of an admixed population under neighbor-joining. We show that with an additive distance matrix, a population admixed among two source populations necessarily lies on the path between the sources. Relaxing the additivity requirement, we examine the smallest nontrivial case—four populations, one of which is admixed between two of the other three—showing that the two source populations never merge with each other before one of them merges with the admixed population. Furthermore, the distance on the constructed tree between the admixed population and either source population is always smaller than the distance between the source populations, and the external branch for the admixed population is always incident to the path connecting the sources. We define three properties that hold for four taxa and that we hypothesize are satisfied under more general conditions: antecedence of clustering, intermediacy of distances, and intermediacy of path lengths. Our findings can inform interpretations of neighbor-joining trees with admixed groups, and they provide an explanation for patterns observed in trees of human populations.
Distance matrix methods in phylogenetics construct trees of taxa using algorithms applied to matrices that tabulate pairwise evolutionary distances between the taxa.1,2 Among these methods, neighbor-joining3,4 is one of the most popular.5–7 One of its key features is its consistency: if the distance matrix is additive, such that a tree of taxa exists that generates the distances in the matrix, then neighbor-joining recovers this exact tree.5,8,9 Further, neighbor-joining is robust in that theoretical and simulation-based studies have found it to infer sensible trees under a broad range of mathematical and biological conditions.7,9–13
As trees have long been used in population genetics to describe relationships among populations,14,15 the neighbor-joining algorithm has been applied extensively as a population clustering tool, using distance matrices calculated from population-level allele frequencies. In humans, neighbor-joining trees have been and continue to be a regular feature of studies of population relationships.16–20 In population-genetic studies, because migration and admixture sometimes generate evolutionary histories that cannot easily be described by a bifurcating tree of populations, a neighbor-joining tree is treated as a type of population clustering diagram rather than a precise representation of the evolutionary history of the populations.
When neighbor-joining has been used with admixed populations—populations recently descended from two or more source groups that have long been separated—particular characteristics of the inferred trees have often been observed (Fig. 1). For example, one simulation study based on human data identified a reduction in the external branch length leading to an admixed population as the strength of gene flow with other populations was increased.21 It has also been suggested on the basis of observed human population trees that a short external branch for a population on a constructed neighbor-joining tree can imply recent admixture of the population, and that admixed populations often appear in the “middle” of a neighbor-joining tree, on branches incident to paths connecting possible source populations.21–25 This pattern is evident in Fig. 2, in which admixed Mestizo populations from Latin America lie on branches incident to the path connecting Native American and European populations. Here, we seek to understand these results on the behavior of admixed populations in the application of the neighbor-joining algorithm. We therefore apply neighbor-joining to populations that satisfy a simple admixture model, first considering the case in which the distance matrix is additive. Next, for the case of n = 4 taxa, we use a mechanistic mathematical investigation to examine three specific properties of neighbor-joining trees involving an admixed population.
We briefly review the neighbor-joining algorithm.3,4 Consider a set of n taxa, together with a distance function d computed for each pair of taxa, such that the distance between taxa i and j is denoted dij. The algorithm takes as input the distance matrix D containing entries dij, with i and j ranging from 1 to n, and it outputs a bifurcating unrooted tree. D is symmetric (dij = dji), with zeroes on the diagonals (dii = 0) and nonnegative real entries (dij ≥ 0).
As in other agglomerative algorithms that construct bifurcating trees,2,32 at each of a series of steps, the two nearest taxa according to a selection criterion are connected to a new interior node, becoming “neighbors” on the constructed tree. Branch lengths from the new node to the nodes it agglomerates, as well as the distances to all remaining nodes, are then calculated, and a new distance matrix is obtained. This procedure is repeated iteratively until the last three nodes remain, and these three nodes are then connected to a final interior node. Because the last three nodes are always joined, the number of taxa must exceed three for neighbor-joining to have a nontrivial decision at the first step.
At each step, the key decision is the choice of the pair of taxa that are agglomerated. Neighbor-joining uses an n × n matrix Q, containing entries qij for pairs of taxa (i, j):
The two taxa that are agglomerated are those with the minimal value of qij (choosing randomly in case of ties). If taxa i and j are agglomerated, then their distances to the new node u become
The distances of all remaining nodes k to node u are computed as
The next agglomeration then proceeds from an (n − 1) × (n − 1) distance matrix that replaces distances involving nodes i and j with those involving the single node u.
We examine a scenario in which one of the taxa is admixed among two of the others. This taxon can be viewed as having been formed from its two source taxa, such that individual members of the taxon have ancestors in both source groups. We label the taxa t1, t2, …, tn. Without loss of generality, let taxon tn be the admixed group, and suppose that it is an admixture of taxa t1 and t2. The relationships among the remaining n − 3 taxa (t3, t4, …, tn−1) and between these taxa and t1, t2, and tn are not specified; we do not consider any additional admixture relationships that might exist among these taxa. We assume n ≥ 4, so that at least one taxon is considered in addition to t1, t2, and tn.
In a standard statistical model of admixture used in population genetics, allele frequencies in an admixed taxon are given by linear combinations of the allele frequencies of the source taxa.33–37 We denote by λ the proportion of the ancestry of taxon tn arising from t1 and by 1 − λ the corresponding proportion arising from t2, where 0 < λ < 1. For any allelic type, if pti denotes the frequency of the specified allele in taxon ti, then
It follows that if for each of the taxa in a pair, a distance function d is linear in each component of the allele frequency vector at a locus, then the distances between the admixed taxon and other taxa are obtained as linear combinations of corresponding distances involving taxa t1 and t2. Therefore, for 1 ≤ i ≤ n − 1,
Eq. 6 continues to hold if for a series of loci, the distance function d is linear for each taxon in each component of the allele frequency vector at each locus, as would occur if the distance between a pair of taxa at a set of loci were computed as the mean of locus-wise distances that were each linear in the components of the allele frequency vector at the specified locus.
We assume that the distance function supplied to neighbor-joining satisfies eq. 6, and that it is symmetric, nonnegative, and zero if and only if it is computed between a taxon and itself; we otherwise do not concern ourselves with the form of the function. While typical population-genetic distance functions often involve nonlinear relationships with allele frequencies and do not necessarily follow eq. 6—consider the nonlinear graphs in Figure 3 of Boca & Rosenberg,38 which illustrate that for the FST measure and an admixed population tn whose frequencies are linear combinations of those of populations t1 and t2, FST (tn, t1) ≠ λFST (t1, t1) + (1 − λ)FST (t2, t1)—eq. 6 is a natural extension of the ubiquitous eq. 5 from allele frequencies to distance functions. For FST, it can be shown from eqs. 1 and 7 of Boca & Rosenberg38 that for small λ, FST (tn, t1) ≈ λFST (t1, t1) + (1 − λ)FST (t2, t1). Thus, we view eq. 6 as a reasonable first approximation for examining properties of neighbor-joining in an admixture scenario.
Our goal is to construct a distance matrix according to the admixture rule in eq. 6, mechanistically apply neighbor-joining to the matrix, and characterize the properties of the inference process and the resulting inferred tree. We examine two settings. In the first, arbitrarily many taxa are considered, and their distances produce an additive distance matrix (and therefore satisfy a tree metric39). In the second, a general matrix is investigated, with distances that do not necessarily follow a tree metric, but the matrix includes only four taxa.
We first assume that the distance matrix is additive. In this case, by the consistency property of the neighbor-joining algorithm,5,8,9 distances between taxa on the constructed neighbor-joining tree exactly equal those of the input matrix. Denote by the distance function computed for pairs of nodes in the inferred neighbor-joining tree, such that for taxa ti and tj, titj is the sum of the lengths of the branches on the path connecting ti and tj. Recalling that d represents distance in the input distance matrix, if the matrix is additive, then for all (ti, tj),
Because dt1,tn = (1 − λ)dt1,t2 and dt2,tn = λdt1,t2 by eq. 6,
It follows that t1,tn + t2,tn = t1,t2, from which we can infer that taxa t1, t2, and tn are collinear in the inferred neighbor-joining tree, with tn in the interior of the path from t1 to t2.
We can obtain an even stronger result. Consider a case with at least four taxa: t1, t2, tn, and, without loss of generality, t3 (Fig. 3A). In the inferred neighbor-joining tree, a path of length c connects taxon t3 to some point P on the path from t1 to t2 (including the endpoints). Without loss of generality, we can assume that P lies on the path from t1 to tn (including the endpoints). We denote the distances t1,P and P,tn by nonnegative values y and z, respectively. We denote t1,t2 = dt1,t2 = x, for some nonnegative x.
In other words, c + z = λ(c + y) + (1 − λ)(c + x − y). Together with the relationship y + z = (1 − λ)x and the assumption that λ > 0, eq. 11 implies that y = 0. It then follows that taxon t3 lies on a line with taxa t1, t2, and tn. Further, taxon t3 lies on the side of taxon t1 opposite to taxa t2 and tn; otherwise, by eq. 11, we would have (1 − λ)x − c = λc + (1 − λ)(x − c), which requires c = 0. In turn, c = 0 implies t3,t1 = 0, and hence, dt3,t1 = 0, contradicting the assumption that all pairs of taxa are separated by positive distances in the distance matrix.
We have therefore shown that for an additive tree with taxon tn admixed between t1 and t2, any additional taxon beyond t1, t2, and tn must be collinear with t1, t2, and tn, and must lie exterior to the path connecting t1 and t2. Thus, each additional taxon t3, t4, …, tn−1 is connected to t1 or t2 by an external branch. The admixture model together with the assumption of an additive distance matrix imposes such a strong restriction on the set of allowed distance matrices that it forces all taxa onto a highly constrained tree (Fig. 3B). When we consider the placement of each taxon t3, t4, …, tn−1, we find that this tree has two multifurcating nodes separated by a line that joins taxa t1 and t2, with tn as the only intervening taxon.
The additive case can assist in explaining phenomena observed empirically with admixed populations in the application of neighbor-joining:21–25 in the additive case, tn has external branch length 0, a result compatible with the short external branches detected for admixed taxa. Further, tn lies on the path connecting t1 and t2, compatible with the observation that admixed taxa lie in the “middle” of inferred neighbor-joining trees, with external branches incident to the paths connecting their source taxa. We can thus see that the empirical Fig. 2 resembles Fig. 3B, as the short internal branches among Native Americans and Europeans give rise to a shape with near multifurcations on each side of the admixed Mestizo groups.
The additive case is restrictive and atypical of the population-genetic context, in which migration and admixture generate non-tree-like evolution. We can then consider the more general setting of arbitrary genetic distance matrices with positive entries, examining the smallest nontrivial case, with n = 4 taxa. In this case, the admixed taxon is t4, with source taxa t1 and t2. We set the distances among taxa t1, t2, and t3 to be dt1,t2 = x, dt1,t3 = y and dt2,t3 = z, for some positive x, y, and z. Employing eq. 6, the distance matrix D has the form:
Using eq. 1 to calculate the matrix Q used in deciding which taxa will agglomerate, we obtain
Examining the relationships among q1, q2, and q3, we have that
As in the work of Eickmeyer & Yoshida,40 we partition the four-dimensional space of possible values of (λ, q1, q2, q3) according to the tree topologies produced by neighbor-joining.
Three tree topologies are possible with the four taxa (Fig. 4). In Fig. 4A, taxon t4 is separated by three edges from taxa t1 and t2, which themselves are separated by only two edges. In Fig. 4B, t4 is separated by two edges from t2 and by three edges from t1; t1 and t2 are separated by three edges. Taxa t1 and t2 are also separated by three edges in Fig. 4C, but t4 is instead separated by two edges from t1 and by three edges from t2.
Seven possibilities exist for the smallest entry of Q: (1) q1, (2) q2, (3) q3, (4) q1 and q2 (tied), (5) q1 and q3 (tied), (6) q2 and q3 (tied), and (7) q1, q2, and q3 (all tied). Each choice leads to a particular outcome among the three tree topologies in Fig. 4, with two or more topologies being possible outcomes in cases that involve ties. For each value among q1, q2, and q3, two pairs of taxa produce the same value in the matrix Q. It can be shown that in each case, either choice of which pair is first to agglomerate leads to the same inferred tree. Without loss of generality, we choose the pair that does not include taxon t3.
Four of the seven cases are not possible. In case 1, summing x + z < y and x + y < z in eqs. 17 and 18, we obtain x < 0. In case 4, setting q1 = q2 in eq. 17, x + z = y, from which we obtain x < 0 using x + y < z in eq. 18. Similarly, in case 5, q1 = q3 in eq. 18 produces x < 0 using x + z < y in eq. 17. In case 7, eqs. 17–19 become equalities, leading to x = 0 when x + z = y is substituted into eq. 19. All of these cases contradict the assumption that x > 0.
We consider the three allowable cases (cases 2, 3, and 6). For each of the possible inferred neighbor-joining trees, denote by u the unique interior node that places taxa t1, t2, and t4 in distinct subtrees (Fig. 4). Denote by the distance between nodes on the inferred tree. In case 2, q2 is smallest, taxa t2 and t4 agglomerate first, and using eqs. 2–4, we obtain
We can show that u,t4 < u,t2 and u,t4 < u,t1. The first of these two inequalities is equivalent to λy < λ(x + z), which holds because λ > 0, and because y < x + z by eq. 17. For the second inequality, note first that y < (1 − 2λ)x + z by eq. 19. Substituting the right-hand side in place of y in eq. 21, u,t4 is less than [2λ(1 − λ)]x/4, which in turn is less than u,t1 because 0 < λ < 1.
Similarly to case 2, we show u,t4 < u,t1 and u,t4 < u,t2. The first inequality is equivalent to (1 − λ)z < (1 − λ)(x + y), which holds because λ < 1, and because z < x + y by eq. 18. For the second equality, (1 − 2λ)x + z < y by eq. 19. Substituting the left-hand side in place of y in eq. 24, u,t4 is less than [(2λ(1 − λ)]x/4, which in turn is smaller than u,t2 because 0 < λ < 1.
Finally, in case 6, q2 and q3 are tied with the smallest values, and either t2 and t4 agglomerate first as in case 2, or t1 and t4 agglomerate first as in case 3. Neighbor-joining produces the tree in Fig. 4C with probability 1/2, and the tree in Fig. 4B with probability 1/2. With either choice, the same arguments used to demonstrate u,t4 < u,t1 and u,t4 < u,t2 in cases 2 and 3 apply, except that y is equal to (instead of greater than or less than) (1 − 2λ)x + z.
This collection of results demonstrates three phenomena for four-taxon trees built from distance matrices formed according to our admixture model. (1) The admixed taxon agglomerates with one of its two source taxa before the sources agglomerate with each other. Cases 2, 3, and 6 are the only ones allowable, and in these cases, the first neighbor-joining step agglomerates the admixed taxon t4 with one of the sources. (2) Denoting by u the unique node for which the admixed taxon and its source taxa all lie in different subtrees, the distance on the neighbor-joining tree of the admixed taxon to u is smaller than the distances to u of both source taxa. We demonstrated this result in each of the allowed cases, and it therefore holds in general. (3) The number of edges separating the source taxa on the inferred neighbor-joining tree, for each source taxon, is greater than or equal to the number of edges separating the admixed taxon from the source taxon. Only the trees in Figs. 4B and 4C are possible outcomes of neighbor-joining in our model, and the result holds for each of these trees.
Using the four-taxon results, we can formally define three properties of a distance matrix and its resulting neighbor-joining tree. The properties are well-defined for arbitrary n, and it is possible to evaluate whether a given n-taxon distance matrix satisfies them when neighbor-joining is applied. All three properties are possessed by all matrices generated by the four-taxon case of our admixture model.
The admixed taxon clusters with one of its source taxa before the source taxa cluster together. Stated precisely, some clade containing t1 but not t2 or tn merges with some clade containing tn but not t1 or t2, or, some clade containing t2 but not t1 or tn merges with some clade containing tn but not t1 or t2, before any clade containing t1 but not t2 or tn merges with any clade containing t2 but not t1 or tn.
Here we allow a clade to have any size, and potentially only a single taxon. In identifying the steps at which t1, t2, and tn merge into the neighbor-joining tree, as in our four-taxon case, to ensure that these taxa do not all merge simultaneously at the final stage, we adopt the convention that if a four-taxon stage is reached in which t1, t2, and tn lie in separate subtrees, we choose to agglomerate two among these three subtrees rather than agglomerating the third one with the unique available subtree that does not contain t1, t2, or tn.
The distance on the constructed neighbor-joining tree between the admixed taxon and either of its source taxa is smaller than the corresponding distance between the two source taxa. That is, t1,tn < t1,t2 and t2,tn < t1,t2. Equivalently, if u is the unique node in the constructed neighbor-joining tree for which t1, t2, and tn lie in different subtrees, then u,tn < u,t1 and u,tn < u,t2.
The number of edges separating the source taxa in the constructed neighbor-joining tree is greater than or equal to the number of edges separating the admixed taxon and either source taxon. If we define ij as the number of edges in the path separating nodes i and j in the inferred tree, then t1,t2 ≥ t1,tn and t1,t2 ≥ t2,tn.
We have already demonstrated that in our admixture model, Properties 2 and 3 hold for all distance matrices in the n-taxon additive case; for Property 2, using eqs. 8 and 9 and 0 < λ < 1, t1,tn < t1,t2 and t2,tn < t1,t2. For Property 3, we have shown that for an n-taxon additive distance matrix, taxon tn lies on the interior of the path connecting t1 and t2, and it is the only taxon so located. Thus, t1,t2 = 2, while t1,tn = t2,tn = 1, and Property 3 holds.
We have examined neighbor-joining in a model in which an admixed taxon is produced from two source taxa, finding that for a four-taxon scenario, distance matrices and their resulting trees possess three properties: antecedence of clustering, in which the admixed population clusters with one of the sources before the sources cluster with each other; intermediacy of distances, in which the distance on the constructed tree between the admixed taxon and either source taxon is less than the distance between the sources; and intermediacy of path lengths, in which the number of edges separating the admixed taxon and either source taxon is no larger than the number of edges separating the sources. We have further shown that for an arbitrary number of taxa, the latter two properties hold when the distance matrix is additive.
By a mechanistic examination, we have found that our model has features seen in empirical observations of neighbor-joining trees that involve admixed populations. In particular, the placement of admixed populations on short external branches incident to the paths connecting their source populations21–25 matches the demonstration in the additive and four-taxon cases of the intermediacy of distances and intermediacy of path lengths properties. The theoretical approach validates the view that populations that are centrally located on neighbor-joining trees and that possess short external branches might be recently admixed.
Our results suggest a broader investigation of the extent to which the three properties hold with an arbitrary number of taxa. We have not reported a result regarding antecedence of clustering in the n-taxon additive case, nor have we commented on any of the properties for general n-taxon distance matrices that are not necessarily additive. However, we expect that Properties 1–3 will be satisfied by our admixture model considerably more often than in a model in which no special constraints are imposed on distances that involve the nth taxon. As our model also involves an nth taxon with special features, the general analysis of the model might benefit from the “rogue taxon” framework of Cueto & Matsen,41 in which the addition of an nth taxon alters the tree produced for an initial group of n − 1 taxa.
An additional direction is to study alternative admixture models. Distance methods are most sensible when a distance is nearly additive; however, eq. 6 severely restricts the distance matrix, as it forces a structure with two multifurcating nodes. This aspect of the model can be relaxed by assuming that the distance is additive for taxa t1, t2, …, tn−1, and that only distances involving tn satisfy eq. 6. For 1 ≤ i ≤ n − 1, we can then apply the distance
With this distance function, tn is simply placed on the path from t1 to t2 in a preexisting tree relating taxa t1, t2, …, tn−1 (Fig. 5). Properties 2 and 3 continue to hold.
To obtain eq. 26, we first suppose that t1, t2, …, tn−1 have an additive distance matrix. We wish to place taxon tn on the tree that generates the matrix so that the matrix for t1, t2, …, tn is additive. First, given λ, tn is placed on the path from t1 to t2 such that eqs. 8 and 9 are satisfied. It remains to compute tn,ti for i = 3, 4, …, n − 1. Denote by u the unique node of the tree that places t1, t2, and ti in distinct subtrees (Fig. 5). Then
If t1,tn ≤ t1,u, then tn lies on the path from t1 to u (Fig. 5A), and
If, on the other hand, t1,tn ≥ t1,u, then tn lies on the path from t2 to u (Fig. 5B), and
Analysis of the three properties using this modified form for the admixture model, or more generally using specific distance functions commonly employed in population genetics, will further illuminate the features of neighbor-joining in admixed populations. Such analyses might also facilitate investigations of the behavior with admixed populations of other tree-building methods, or of phylogenetic network methods42 that are more directly designed to accommodate taxa with non-tree-like evolutionary histories.
We thank M. Feldman and four reviewers for comments. This work has been supported by National Institutes of Health grant R01 GM081441, by National Science Foundation grants BCS-1024627 and DBI-1146722, and by the Burroughs Wellcome Fund.
NAAMA M. KOPELMAN, Porter School of Environmental Studies, Department of Zoology, Tel Aviv University, Ramat Aviv, Israel.
LEWI STONE, Porter School of Environmental Studies, Department of Zoology, Tel Aviv University, Ramat Aviv, Israel.
OLIVIER GASCUEL, Méthodes et Algorithmes pour la Bioinformatique, LIRMM-CNRS, Montpellier, France.
NOAH A. ROSENBERG, Department of Biology, Stanford University, Stanford, California, USA.