Home | About | Journals | Submit | Contact Us | Français |

**|**Algorithms Mol Biol**|**v.7; 2012**|**PMC3393622

Formats

Article sections

Authors

Related links

Algorithms Mol Biol. 2012; 7: 14.

Published online 2012 May 21. doi: 10.1186/1748-7188-7-14

PMCID: PMC3393622

Frederick A Matsen : gro.crchf@nestam; Steven N Evans: ude.yelekreb.tats@snave

Received 2011 October 5; Accepted 2012 May 21.

Copyright ©2012 Matsen and Evans; licensee BioMed Central Ltd.

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

There are several common ways to encode a tree as a matrix, such as the adjacency matrix, the Laplacian matrix (that is, the infinitesimal generator of the natural random walk), and the matrix of pairwise distances between leaves. Such representations involve a specific labeling of the vertices or at least the leaves, and so it is natural to attempt to identify trees by some feature of the associated matrices that is invariant under relabeling. An obvious candidate is the spectrum of eigenvalues (or, equivalently, the characteristic polynomial).

We show for any of these choices of matrix that the fraction of binary trees with a unique spectrum goes to zero as the number of leaves goes to infinity. We investigate the rate of convergence of the above fraction to zero using numerical methods. For the adjacency and Laplacian matrices, we show that the *a priori *more informative immanantal polynomials have no greater power to distinguish between trees.

Our results show that a generic large binary tree is highly unlikely to be identified uniquely by common spectral invariants.

Tree shape theory furnishes numerical statistics about the structure of a tree [1,2]. (Because we are interested in applications of tree statistics to trees that describe the structure of branching events in evolutionary histories, we will, for convenience, always take the term *tree *without any qualifiers to mean a **rooted, binary tree without any branch length information or labeling of the vertices**.) Such statistics have two related uses. Firstly, they can be used in an attempt to tell whether two trees are actually the same and, secondly, they can be used to indicate the degree of similarity between two trees with respect to some criterion.

Examples of the latter use are the testing of hypotheses about macroevolutionary processes and the detection of bias in phylogenetic reconstruction. Historically, numerical statistics for such purposes have attempted to capture the notion of the *balance *of a tree, which is the degree to which daughter subtrees are the same size. The balance is typically measured by ad-hoc formulae that are often selected for statistical power to distinguish between two different distributions on trees [3,4]. In previous work we investigated the possibility of describing the shape of the tree using a list of numbers rather than just a single number [5,6].

A mathematically "canonical" approach to finding a list of such numbers is to use information derived from matrix representations of the trees. We first describe the matrix representations of a tree that we will consider.

In algebraic graph theory [7], the basic matrix associated to a graph is the *adjacency matrix A*(*G*), whose *ij*^{th }entry is one if *i *and *j *are connected by an edge, and zero otherwise. From a probabilistic point of view, the more natural matrix to associate with a graph is the *Laplacian matrix L*(*G*), which is the infinitesimal generator of the natural random walk on the graph and is given by *A*(*G*) - *D*(*G*), where *D*(*G*) is the diagonal matrix of vertex degrees. It is clear that a graph can be recovered from either its adjacency of Laplacian matrix. Some authors, such as [8], define the Laplacian to be *D*(*G*)^{-1/2}*L*(*G*)*D*(*G*)^{1/2}. Note that this difference is not relevant if one is only considering characteristics of the matrix *L *such as eigenvalues that are invariant under similarity transformations.

Readers in the phylogenetics community may be more familiar with the *pairwise distance matrix *[1,9]. The distance matrix *P *given a leaf-labeling 1, . . ., *n *has as its *ij*^{th }entry the length of the path between leaf *i *and leaf *j*. Any leaf-labeled tree is uniquely determined by its distance matrix. These matrices have also been extensively studied as discrete metric spaces [10,11].

The definition of the adjacency and Laplacian matrices requires a numbering of the vertices, while the definition of the distance matrix requires a numbering of the leaves. Because we are considering unlabeled trees (that is, we identify trees that are equivalent in the usual sense of graph-theoretic isomorphism), we are only interested in tree statistics that are invariant under renumbering. Algebraically, this means that we are only interested in features of the associated matrix that are unaffected by similarity transformations via a permutation matrix. The most obvious such statistics are the eigenvalues.

The adjacency and Laplacian matrices and their eigenvalues are familiar objects in the area of spectral graph theory [7,8,12]. The eigenvalues of the adjacency matrix tend to contain combinatorial information about the graph, such as bounds on the chromatic number. The eigenvalues of the Laplacian give information of a more geometric flavor, such as the equivalent of the surface area to volume ratio of subgraphs of a graph. As well as having connections to the theory of random walks on graphs, the Laplacian eigenvalues can be used to define the expander graphs, an important class of graphs that have applications in coding theory. Therefore, it would not be too surprising if the these eigenvalues were a convenient way to summarize information about a tree, thus giving a nice collection of tree statistics.

Similarly, it seems plausible that the eigenvalues of the pairwise distance matrix could contain quite a lot of information about the tree that could be used to compare trees. Moreover, although the distance matrix formally contains the same information as the adjacency or Laplacian matrices, the transformation that takes the distance matrix to one of the other two is distinctly non-linear, and hence there is no reason to believe that there is any simple connection between the corresponding eigenvalues.

We demonstrate below that not only do there exist pairs of trees that have the same spectrum as another tree for the adjacency, Laplacian, and distance matrices, but that this is the rule rather than the exception as the trees become large, in the sense that the fraction of trees with a given number of leaves that have a unique adjacency, Laplacian, or distance spectrum goes to zero as the number of leaves goes to infinity.

The basic methodology that we use to prove this result was first established in [13] and developed in [14] for general (that is, not necessarily bifurcating) graph-theoretic trees in the case of the adjacency and Laplacian matrices. The present paper provides the first results of this type concerning rooted bifurcating trees, as well as the first examination of such results for the pairwise distance matrix. The key idea is to establish that certain pairs of trees *T*_{1 }and *T*_{2 }have the following *exchange property *for a given matrix representation: that exchanging *T*_{1 }for *T*_{2 }as subtrees of a given tree does not change the spectrum for that matrix representation. This is a stricter requirement than simply having the same spectrum (Figure (Figure1).1). It then becomes a matter of showing that the number of trees with a given number of leaves is asymptotically of larger order than the number of trees with the same number of leaves that don't have a particular subtree. For this we build on the generating function argument used in [15] for asymptotic estimates of the number of unlabeled rooted bifurcating trees (see the section *Asymptotic numbers of trees*).

One possible explanation for this phenomenon is that two diagonalizable matrices have the same spectrum if they are similar via an arbitrary similarity transformation rather than just via a permutation transformation, and this suggests considering features of a matrix that are invariant under permutation similarities but not more general ones. We will now describe a feature of a matrix, its *immanantal polynomial, *that has this property.

The *immanant *is a generalization of the determinant. Recall that the determinant of a matrix *A *= (*a*_{ij}) is given by

$$\text{det}\left(A\right):=\sum _{\sigma {S}_{n}}$$

where the sum is over the symmetric group of permutations of {1, 2,..., *n*} and sgn(*σ*) is the *sign *of the permutation *σ.*

The function sgn is a particular example of a *character *of an *irreducible representation *of the symmetric group. It would take us too far afield to define these notions here, but excellent treatments may be found in [16-19]. We note, however, that the irreducible characters are constant on the conjugacy classes of the symmetric group (recall that two permutations belong to the same conjugacy class if and only if they have the same cycle structure) and they form a basis for the vector space of functions with this property (the *class functions).*

Our use of characters is simply to define the immanant

$${\mathcal{I}}_{\chi}\left(A\right):=\sum _{\sigma {S}_{n}}$$

of a matrix *A *for the irreducible character χ. A discussion of immanants may be found in [20,21]. The *immanantal polynomial *for a character *χ *of a matrix is the corresponding generalization of the characteristic polynomial; that is, it is the polynomial $x{\mathcal{I}}_{\chi}\left(xI-A\right)$. Because the characters are class functions, an immanantal polynomial is invariant under similarity by permutation matrices, but it will not typically be invariant under more general similarities.

Unfortunately, as we show in Lemma 2, for either the adjacency or Laplacian matrix the following two conditions on a pair of trees are equivalent:

• the spectra are equal,

• the immanantal polynomials are equal for all irreducible characters.

Consequently, the immanantal polynomials for the adjacency and Laplacian matrices provide no more distinguishing power than the spectra and, in particular, a vanishing fraction of large trees have a unique set of immanantal polynomials for these matrices. We do not know if the same fact is true for the immanantal polynomials of the distance matrix.

Our main result is thus the following.

**Theorem 1**. *Let t*_{n }*be the number of trees with n leaves. For either the adjacency, Laplacian, or pairwise distance matrix, let l*_{n }*be the number of trees with n leaves that do not share their spectrum with another tree. Then, the fraction l*_{n}*/t*_{n }*goes to zero as n goes to infinity. For the adjacency and Laplacian matrices, the same result holds if we replace the spectrum by the complete set of immanantal polynomials*.

The rate of convergence of the fraction in Theorem 1 is also of interest. If it is extremely slow then the existence of trees with shared spectra may not be practically relevant for the construction of informative tree shape statistics. We investigate this matter numerically towards the end of the paper.

Several other research groups have investigated problems that are related to, but different from, those investigated here. Steyaert and Flajolet [22] investigate the occurrences of subtrees in the case of "planar" trees, i.e. trees that are equipped with an order of subtrees at every internal node. These planar trees are substantially easier to analyze: for example, there is a nice closed form generating function for the numbers of such trees (the Catalan numbers). In contrast, the generating function for the number of trees considered here is only given as the solution of a functional equation and there is no closed form expression for the numbers of such trees. Graham and Lovasz [23] investigate the spectra of distance matrices of trees, but their distance matrices are defined in terms of vertex-to-vertex distances, rather than the leaf-to-leaf definition. The leaf-to-leaf distance matrix is a principal sub-matrix of the vertex-to-vertex one, and there are interlacing relations between the spectrum of a matrix and one of its principal submatrices (2). However it is not a priori the case that if two matrices have the same spectrum, then the principal submatrices with the same rows and columns will also have the same spectrum. In a similar vein, the vertex-to-vertex distance matrix can be constructed from the leaf-to-leaf distance, but the construction involves considering whether certain linear inequalities hold and so it isn't a procedure that will, a priori, transform spectra in a simple way.

More recently, Bhamidi, Evans, and Sen [24] have proven that, subject to weak general conditions, many ensembles of random trees have the property that, with probability converging to one as the number of leaves goes to infinity, a realization shares its spectrum with another tree. Their conditions are easiest to check when the ensemble can be embedded in a general continuous-time branching process where individuals give birth to a possibly random number of offspring at the arrival times of a point process up to a possibly infinite death time, and those offspring go on to behave as independent copies of their parent. This particular framework covers examples such as random recursive trees, linear preferential attachment trees, uniform rooted unordered labeled trees, and Yule trees. However, we have been unable to embed the ensemble considered here in a suitable continuous-time branching process or otherwise check the general conditions of [24].

The computer code used in this paper was written in `OCaml`[25] and has been made available at http://github.com/matsen/ubiquity_synonymity, along with the results produced by this code.

In order to prove results for the adjacency and Laplacian matrices simultaneously, we define for a tree *T *and arbitrary real numbers *y *and *z *the *generalized Laplacian *$\tilde{L}\left(T\right):=yD\left(T\right)+zA\left(T\right)$ (recall that *A*(*T*) is the adjacency matrix and *D*(*T*) is the diagonal matrix of vertex degrees). We define the corresponding *generalized Laplacian immanantal polynomial *of the tree *T *with *r *vertices to be

$$x{\mathcal{I}}_{\chi}\left(xI-\tilde{L}\left(T\right)\right)$$

for an irreducible character *χ *of the symmetric group *S*_{r}.

The generalized Laplacian immanantal polynomial can be computed in a simple combinatorial fashion as follows. Define a *k-matching *to be a set of *k *pairwise disjoint edges of the tree (that is, a set of edges such that no two share a common vertex). Let *M*_{k}(*T*) denote the set of *k*-matchings on the tree *T*. We think of an edge as a pair of vertices, so when we use the notation $ip$ for a vertex *i *and a matching *p*, we mean that *i *is not one of the ends of any edge in *p. *Let *C*_{k }denote the conjugacy class of the symmetric group *S*_{r }consisting of permutations that are the product of *k *disjoint transpositions, and write χ(*C*_{k}) for the common value of the character *χ *on such permutations. The following lemma appears in [14] and is included for completeness.

**Lemma 1**. *The generalized Laplacian immanantal polynomial of the tree T for the character χ is given by*

$$\sum _{k\ge 0}\chi \left({C}_{k}\right){z}^{2k}\sum _{p{M}_{k}\left(T\right)}$$

*where d*_{i}(*T*) *is the degree of vertex i in the tree T*.

*Proof. *Set $M:=xI-\tilde{L}\left(T\right)=\left({m}_{ij}\right)$ so that the generalized Laplacian immanantal polynomial is

$$\sum _{\sigma {S}_{n}}$$

(1)

The matrix entries *m*_{ij }are zero unless *i *= *j *or there is an edge between *i *and *j. *If the permutation *σ *has a cycle of length 3 or greater, then corresponding term in (1) must be zero because otherwise the tree would have a loop. Therefore we need only consider permutations that are products of disjoint transpositions where, moreover, each transposition exchanges the two vertices of an edge. Such a permutation is equivalent in an obvious way to a *k*-matching for some *k*, and the lemma follows.

**Lemma 2**. *Two trees have the same spectrum for their generalized Laplacian if and only if they have the same generalized Laplacian immanantal polynomial for all characters.*

*Proof. *One direction is trivial: if two trees have the same generalized Laplacian immanantal polynomial for all characters, then their generalized Laplacians have the same characteristic polynomial and hence the same spectrum.

Conversely, if the generalized Laplacians of two trees have the same spectrum, then the characteristic polynomials of the generalized Laplacians are the same. Lemma 1 in the case *χ *= sgn, the fact that sgn(*C*_{k}) = ±1 ≠ 0 for all *k*, and the fact that two equal polynomials have the same coefficients imply that the quantity

$$\sum _{p{M}_{k}\left(T\right)}$$

is the same for both trees for all *k. *Another application of Lemma 1 completes the proof.

We use the phylogenetic rather than graph-theoretic definition of a subtree as follows. Define the *distal vertex *of an edge to be the vertex farthest from the root that touches that edge. Then, a subtree of a given rooted tree is what results from separating an edge from its distal vertex, which then becomes the root of the subtree.

Recall that *M*_{k}(*T*) is the set of *k*-matchings of the tree *T*. Let *N*_{k}(*T*) be the set of *k*-matchings where the chosen edges do not contact the root.

Define

$$\begin{array}{c}{P}_{k}\left(T\right):=\sum _{p{M}_{k}\left(T\right)}\end{array}$$

The following lemma is implicit in [14], but again we include a proof for completeness.

**Lemma 3**. *Let S*_{1 }*and S*_{2 }*be trees with the same number of leaves. If P*_{k}(*S*_{1}) = *P*_{k}(*S*_{2}) *and Q*_{k}(*S*_{1}) = *Q*_{k}(*S*_{2}) *for all k, then any tree with S*_{1 }*as a subtree has the same generalized Laplacian spectrum as the tree obtained by substituting S*_{2 }*for S*_{1}.

*Proof*. Let *T*_{1 }be a tree with *S*_{1 }as a subtree, and write *T*_{2 }for the tree obtained by substituting *S*_{2 }for *S*_{1}. Denote by *e*_{0 }the edge that connects the rest of *T*_{1 }(resp. *T*_{2}) to the root of *S*_{1 }(resp. *S*_{2}).

We differentiate between two types of *k*-matchings of *T*_{i}: those that contain *e*_{0 }and those that do not. Note that a *k*-matching of *T*_{i }that **does not **contain *e*_{0 }restricts to an -matching of *S*_{i }for some , and all matchings of *S*_{i }arise via such a restriction. Similarly, a *k*-matching of *T*_{i }that does contain *e*_{0 }restricts to an -matching of *S*_{i }with the property that the root of *S*_{i }does not belong to any edge in the matching, and all matchings of *S*_{i }with this property arise via such a restriction.

Consider the formula for the characteristic polynomial of the generalized Laplacian matrix that comes from Lemma 1 with *χ *= sgn. Apply this formula to *T*_{1 }and *T*_{2}. The assumption *P*_{k}(*S*_{1}) = *P*_{k}(*S*_{2}) (resp. *Q*_{k}(*S*_{1}) = *Q*_{k}(*S*_{2})) ensures that the matchings that do not include (resp. do include) *e*_{0 }make the same contribution to the respective characteristic polynomials.

The trees depicted in Figure 1 (b) are the smallest pair of rooted bifurcating trees satisfying the criteria of this lemma. The verification of this fact was done by computer, and the corresponding *P*_{k }and *Q*_{k }polynomials are available in the code repository.

We first recall an identity for determinants of partitioned matrices. If

$$C=\left(\begin{array}{cc}\hfill {C}_{11}\hfill & \hfill {C}_{12}\hfill \\ \hfill {C}_{21}\hfill & \hfill {C}_{22}\hfill \\ \hfill \hfill \end{array}\right),$$

then

$$\begin{array}{c}detC=\text{det}\left(\left(\begin{array}{cc}\hfill I\hfill & \hfill -{C}_{12}{C}_{22}^{-1}\hfill \\ \hfill 0\hfill & \hfill I\hfill \\ \hfill \hfill \end{array}\right)C\left(\begin{array}{cc}\hfill I\hfill & \hfill 0\hfill \\ \hfill -{C}_{22}^{-1}{C}_{21}\hfill & \hfill I\hfill \\ \hfill \hfill \end{array}\right)\right)\\ =\text{det}\left(\begin{array}{cc}\hfill {C}_{11}-{C}_{12}{C}_{22}^{-1}{C}_{21}\hfill & \hfill 0\hfill \\ \hfill 0\hfill & \hfill {C}_{22}\hfill \\ \hfill \hfill \end{array}\right)\\ \phantom{\rule{1.5em}{0ex}}=\text{det}\left({C}_{22}\right)\text{det}\left({C}_{11}-{C}_{12}{C}_{22}^{-1}{C}_{21}\right)\\ \phantom{\rule{1.5em}{0ex}}=\text{det}\left({C}_{11}\right)\text{det}\left({C}_{22}-{C}_{21}{C}_{11}^{-1}{C}_{21}\right).\\ \end{array}$$

(2)

**Lemma 4**. *Form two trees T*_{1 }*and T*_{2 }*by gluing the roots of trees S*_{1 }*and S*_{2 }*with distance matrices A*_{1 }*and A*_{2 }*onto the same leaf of a common tree R. Write a*_{i }*for the column vector of distances from the leaves of S*_{i }*to the root of S*_{i}. *Suppose that the following pairs of matrices have the same spectra (where ' denotes transpose):*

$$\begin{array}{c}{A}_{i},\phantom{\rule{1em}{0ex}}i=1,2,\\ \left(\begin{array}{cc}\hfill {A}_{i}\hfill & \hfill {a}_{i}\hfill \\ \hfill {{a}^{\prime}}_{i}\hfill & \hfill 0\hfill \\ \hfill \hfill \end{array}\right),\phantom{\rule{1em}{0ex}}i=1,2,\\ \left(\begin{array}{cc}\hfill {A}_{i}\hfill & \hfill {a}_{i}\hfill \\ \hfill {1}^{\prime}\hfill & \hfill 0\hfill \\ \hfill \hfill \end{array}\right),\phantom{\rule{1em}{0ex}}i=1,2,\\ \end{array}$$

and

$$\left(\begin{array}{cc}\hfill {A}_{i}\hfill & \hfill 1\hfill \\ \hfill {1}^{\prime}\hfill & \hfill 0\hfill \\ \hfill \hfill \end{array}\right),\phantom{\rule{1em}{0ex}}i=1,2,$$

*where ***1 ***is a column vector with each entry *1. *Then, the distance matrices of T*_{1 }*and T*_{2 }*have the same spectrum*.

*Proof*. Write *B *for the distance matrix of *R*. Then, *B *has the partitioned form

$$\left(\begin{array}{c}\hfill \stackrel{}{B}\hfill b\hfill \hfill & \hfill {b}^{\prime}\hfill & \hfill 0\hfill \\ \hfill \hfill \end{array},,\right)$$

where $\stackrel{}{B}$ is the distance matrix of the tree obtained from *R *by deleting the last leaf, *b *is the column vector of distances from the other leaves of *R *to the last leaf. Assume without loss of generality that this last leaf is the attachment point of the *S*_{i}.

Denote by *D*_{i }the distance matrix of *T*_{i}. Observe that

$${D}_{i}=\left(\begin{array}{c}\hfill \stackrel{}{B}\hfill b{1}^{\prime}+1{a}_{i}^{\prime}\hfill \hfill & \hfill {a}_{i}{1}^{\prime}+1{b}^{\prime}\hfill & \hfill {A}_{i}\hfill \\ \hfill \hfill \end{array},.\right)$$

Hence, by (2), *D*_{i }has the characteristic polynomial

$$\begin{array}{ll}\hfill \text{det}\left(xI-{D}_{i}\right)& =\text{det}\left(xI-{A}_{i}\right)\text{det}[(xI-\stackrel{}{B})-\left(-b{1}^{\prime}-1{a}_{i}^{\prime}\right){\left(xI-{A}_{i}\right)}^{-1}\left(-{a}_{i}{1}^{\prime}-1{b}^{\prime}\right)]\phantom{\rule{2em}{0ex}}\hfill & \phantom{\rule{2em}{0ex}}& \hfill & =\text{det}\left(xI-{A}_{i}\right)\text{det}\left[(xI-\stackrel{}{B}),\phantom{\rule{2em}{0ex}}\right)\hfill & \phantom{\rule{2em}{0ex}}& \hfill & \phantom{\rule{2.77695pt}{0ex}}\phantom{\rule{2.77695pt}{0ex}}\phantom{\rule{2.77695pt}{0ex}}\phantom{\rule{2.77695pt}{0ex}}-\left({1}^{\prime}{\left(xI-{A}_{i}\right)}^{-1}{a}_{i}\right)b{1}^{\prime}\phantom{\rule{2em}{0ex}}& \hfill & \phantom{\rule{2em}{0ex}}\\ \hfill & \phantom{\rule{2.77695pt}{0ex}}\phantom{\rule{2.77695pt}{0ex}}\phantom{\rule{2.77695pt}{0ex}}\phantom{\rule{2.77695pt}{0ex}}-\left({1}^{\prime}{\left(xI-{A}_{i}\right)}^{-1}1\right)b{b}^{\prime}\phantom{\rule{2em}{0ex}}& \hfill & \phantom{\rule{2em}{0ex}}\\ \hfill & \phantom{\rule{2.77695pt}{0ex}}\phantom{\rule{2.77695pt}{0ex}}\phantom{\rule{2.77695pt}{0ex}}\phantom{\rule{2.77695pt}{0ex}}-\left({a}_{i}^{\prime}{\left(xI-{A}_{i}\right)}^{-1}{a}_{i}\right)1{1}^{\prime}\phantom{\rule{2em}{0ex}}& \hfill & \phantom{\rule{2em}{0ex}}\\ \hfill & \left(\phantom{\rule{2.77695pt}{0ex}}\phantom{\rule{2.77695pt}{0ex}}\phantom{\rule{2.77695pt}{0ex}}\phantom{\rule{2.77695pt}{0ex}}-\left({a}_{i}^{\prime}{\left(xI-{A}_{i}\right)}^{-1}1\right)1{b}^{\prime}\right].\phantom{\rule{2em}{0ex}}& \hfill & \phantom{\rule{2em}{0ex}}\\ \hfill & \phantom{\rule{2em}{0ex}}& \hfill \end{array}$$

Using (2) again, we see that a partitioned matrix of the form

$$\left(\begin{array}{cc}\hfill A\hfill & \hfill g\hfill \\ \hfill {h}^{\prime}\hfill & \hfill 0\hfill \\ \hfill \hfill \end{array}\right),$$

where *g *and *h *are column vectors, has characteristic polynomial

$$\text{det}\left(xI-A\right)\left[x-{h}^{\prime}{\left(xI-A\right)}^{-1}g\right],$$

and the result follows.

It was verified by computer that the trees in Figure 1 (b) are the smallest such that have distance matrices *A*_{i }and vectors *a*_{i }satisfying the criteria of this lemma. The corresponding characteristic polynomials are available in the code repository. We note with surprise that the smallest pair of trees with the exchange property for the distance matrix are the same as the smallest pair with the exchange property for the generalized Laplacian; this is a curiosity for which we do not have an explanation.

As outlined in the Introduction, the proof of Theorem 1 follows immediately from Lemma 2, Lemma 3, Lemma 4, the existence of the trees in Figure 1 (b), and the following result.

**Proposition 1**. *Let T be a rooted tree. Let t*_{n }*be the number of trees with n leaves. Let s*_{n }*be the number of such trees that do not contain T as a rooted subtree. Then, the fraction s*_{n}*/t*_{n }*goes to zero as n goes to infinity.*

*Proof. *Suppose that *T *has *a *leaves. Let $f\left(x\right):={\sum}_{i=1}^{\infty}{t}_{i}{x}^{i}$ and ${f}_{a}\left(x\right):={\sum}_{i=1}^{\infty}{s}_{i}{x}^{i}$ denote the generating functions for *t*_{n }and *s*_{n}, respectively. Write *ρ *for the radius of convergence of the power series *f *and *ρ*_{a }for the radius of convergence of the power series *f*_{a}. Note that *ρ *≤ *ρ*_{a }< 1.

It is shown in [26] that *ρ = *0.402698 ... and

$$\underset{n\to \infty}{\text{lim}}{n}^{3/2}{\rho}^{n}{t}_{n}=\eta ,$$

where *η = *0.7916032 ... (see [27] for an asymptotic expansion of *t*_{n }that extends this result and [28-31] for reviews of general methods for determining asymptotic numbers of trees of various sorts from a knowledge of the functional equations that their generating functions solve). Since *s*_{n }is *o*(*α*^{-n}) for any 0 <*α *<*ρ*_{a}, it follows that *s*_{n}*/t*_{n }is *o*(*β*^{n}) for any *β > ρ*/*ρ*_{a}, and the proposition will hold if we can show that *ρ *<*ρ*_{a}.

For the sake of completeness and because it serves as a good introduction to the derivation of the functional equation satisfied by the generating function of *s*_{n}, we first derive the well-known functional equation satisfied by the generating function of the *t*_{n}. See the comments after the proof of the lemma for some remarks about the history of the latter generating function.

By decomposing a tree into the two subtrees rooted at the daughters of the root, it is clear that

$$\begin{array}{c}{t}_{n}=\phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}}{t}_{1}{t}_{n-1}+{t}_{2}{t}_{n-2}++{t}_{m-1}{t}_{m+1},\phantom{\rule{2.77695pt}{0ex}}\phantom{\rule{2.77695pt}{0ex}}\phantom{\rule{2.77695pt}{0ex}}\phantom{\rule{2.77695pt}{0ex}}\text{for}\phantom{\rule{2.77695pt}{0ex}}n=2m+1,\hfill \hfill & \hfill {t}_{n}=\phantom{\rule{1em}{0ex}}{t}_{1}{t}_{n-1}+{t}_{2}{t}_{n-2}++{t}_{m-1}{t}_{m+1}+\left(\begin{array}{c}\hfill \begin{array}{c}\hfill {t}_{m}\hfill \\ \hfill 2\hfill \\ \hfill \hfill \end{array}\hfill \\ \hfill \hfill \end{array}\right)+{t}_{m},\hfill \text{for}\phantom{\rule{2.77695pt}{0ex}}n=2m.\hfill \hfill & \hfill \hfill \end{array}$$

These expressions are equivalent to the statement

$$\sum _{i=1}^{n-1}{t}_{i}{t}_{n-i}=2{t}_{n}-{t}_{n/2}$$

(3)

where *t*_{n/2 }is set to zero if *n *is odd.

From (3) the generating function *f *satisfies the functional equation

$$\begin{array}{llll}\hfill {f}^{2}\left(x\right)& =\sum _{n=2}^{\infty}{x}^{n}\sum _{i=1}^{n-1}{t}_{i}{t}_{n-i}\phantom{\rule{2em}{0ex}}& \hfill & \phantom{\rule{2em}{0ex}}\\ \hfill & =\sum _{n=2}^{\infty}{x}^{n}\left(2{t}_{n}-{t}_{n/2}\right)\phantom{\rule{2em}{0ex}}& \hfill & \phantom{\rule{2em}{0ex}}\\ \hfill & =2f\left(x\right)-f\left({x}^{2}\right)-2x.\phantom{\rule{2em}{0ex}}& \hfill & \phantom{\rule{2em}{0ex}}\\ \hfill & \phantom{\rule{2em}{0ex}}& \hfill \end{array}$$

It will be convenient to consider the function *g*: = 1 - *f*, which satisfies the functional equation

$$g\left({x}^{2}\right)=2x+{g}^{2}\left(x\right).$$

(4)

It is shown in [15] that:

• The radius of convergence *ρ *is strictly positive.

• The functional equation (4) has a unique solution in the whole complex plane, and this solution agrees with our power series in {*x * : |*x*| <*ρ*}.

• If, with a slight abuse of notation, we also denote this solution by *g*, then *g*(*ρ*) = 0.

• The point *ρ *is the only zero of *g *within {*x * : |*x*| < 1}.

It is clear from the power series that *g *is continuous and decreasing on [0, *ρ*) and *g*(0) = 1. Hence *g *is strictly positive on [0, *ρ*).

As observed in [15], these observations suggest a method for computing *ρ. *Put $h\left(x\right)=g\left(x\right)/\sqrt{x}$, so that *h *satisfies *h*(*x*^{2}) = 2 + *h*^{2}(*x*). Set

$${w}_{k}\left(\eta \right):={\left(2+{\eta}^{{2}^{k}}\right)}^{{2}^{-k}},\phantom{\rule{1em}{0ex}}\eta ,$$

and

$${q}_{n}:={w}_{n-1}{w}_{n-2}{w}_{0},$$

so that each function *q*_{n }is strictly increasing on [-2, +∞) and *q*_{1 }≤ *q*_{2 }≤ .... In particular, lim_{n→∞ }*q*_{n}(*y*) exists in {+∞} for each *y * [-2, +∞). Moreover,

$$\underset{n\to \infty}{\text{lim}}{q}_{n}\left({h}^{2}\left(x\right)\right)=\underset{n\to \infty}{\text{lim}}{\left(h\left({x}^{{2}^{n}}\right)\right)}^{{2}^{1-n}}=\underset{n\to \infty}{\text{lim}}\frac{{\left(g\left({x}^{{2}^{n}}\right)\right)}^{{2}^{1-n}}}{x}=\frac{1}{x}$$

for 0 <*x *< 1. Therefore

$$\frac{1}{\rho}=\underset{n\to \infty}{\text{lim}}{q}_{n}\left(0\right).$$

Conveniently, (3) holds with *s*_{n }in place of *t*_{n }for all *n *except for *n *= *a*; in this case one simply adds two to the right hand side of the equation to make up for the fact that *s*_{a }= *t*_{a }- 1. Hence *f*_{a }satisfies the functional equation.

$${f}_{a}^{2}\left(x\right)=2{f}_{a}\left(x\right)-{f}_{a}\left({x}^{2}\right)-2x+2{x}^{a}.$$

(5)

Set *g*_{a}: = 1 - *f*_{a}, so that

$${g}_{a}\left({x}^{2}\right)=2x-2{x}^{a}+{g}_{a}^{2}\left(x\right).$$

(6)

It is clear that *g*_{a }is continuous and decreasing on [0, *ρ*_{a}) and *g*_{a}(0) *= *1. Following the arguments in [15], the functional equation (6) has a unique solution in the whole complex plane, and this solution agrees with our power series in {*x * : |*x*| <*ρ*_{a}}. Moreover, analogues of the other properties of *g *obtained in [15] hold for *g*_{a}.

Set ${h}_{a}\left(x\right)={g}_{a}\left(x\right)/\sqrt{x}$, so that

$${h}_{a}\left({x}^{2}\right)=2-2{x}^{a-1}+{h}_{a}^{2}\left(x\right).$$

Put

$${w}_{k,a,\xi}\left(\eta \right):={\left(2-2{\xi}^{{2}^{k}}+{\eta}^{{2}^{k}}\right)}^{{2}^{-k}},\phantom{\rule{1em}{0ex}}\eta ,$$

and

$${q}_{n,a,\xi}:={w}_{n-1,a,\xi}{w}_{n-2,a,\xi}{w}_{0,a,\xi}.$$

Then

$$\underset{n\to \infty}{\text{lim}}{q}_{n,a,{x}^{a-1}}\left({h}_{a}^{2}\left(x\right)\right)=\underset{n\to \infty}{\text{lim}}{\left({h}_{a}\left({x}^{{2}^{n}}\right)\right)}^{{2}^{1-n}}=\underset{n\to \infty}{\text{lim}}\frac{{\left({g}_{a}\left({x}^{{2}^{n}}\right)\right)}^{{2}^{1-n}}}{x}=\frac{1}{x}$$

for 0 <*x *< 1, and, in particular,

$$\frac{1}{{\rho}_{a}}=\underset{n\to \infty}{\text{lim}}{q}_{n,a,{\rho}_{a}^{a-1}}\left(0\right).$$

Now

$$\begin{array}{ll}\hfill {q}_{n,a,{\rho}_{a}^{a-1}}\left(0\right)& ={w}_{n-1,a,{\rho}_{a}^{a-1}}{w}_{n-2,a,{\rho}_{a}^{a-1}}{w}_{0,a,{\rho}_{a}^{a-1}}\left(0\right)\phantom{\rule{2em}{0ex}}\hfill & \phantom{\rule{2em}{0ex}}\hfill & \le {w}_{n-1}{w}_{n-2}{w}_{1}{w}_{0,a,{\rho}_{a}^{a-1}}\left(0\right)\phantom{\rule{2em}{0ex}}\hfill & \phantom{\rule{2em}{0ex}}\hfill & ={q}_{n}\left(-2{\rho}_{a}^{a-1}\right),\phantom{\rule{2em}{0ex}}& \hfill & \phantom{\rule{2em}{0ex}}\\ \hfill & \phantom{\rule{2em}{0ex}}& \hfill \end{array}$$

and so

$$\frac{1}{{\rho}_{a}}\le \underset{n\to \infty}{\text{lim}}{q}_{n}\left(-2{\rho}_{a}^{a-1}\right)\le \underset{n\to \infty}{\text{lim}}{q}_{n}\left(0\right)=\frac{1}{\rho}.$$

It therefore suffices to show that the function *y * lim_{n→∞ }*q*_{n}(*y*) is strictly increasing on (-*ε*, +∞) for some 0 <*ε *< 2.

Observe that the derivative of *q*_{n }satisfies

$${q}_{n}^{\prime}=\underset{{w}_{k}^{\prime}}{\overset{{q}_{k}.}{k=1n-1}}$$

For *k *≥ 1,

$$\begin{array}{llll}\hfill {w}_{k}^{\prime}\left(x\right)& ={x}^{{2}^{k}-1}{\left(2+{x}^{{2}^{k}}\right)}^{{2}^{-k}-1}\phantom{\rule{2em}{0ex}}& \hfill & \phantom{\rule{2em}{0ex}}\\ \hfill & ={\left(2{x}^{-{2}^{k}}+1\right)}^{{2}^{-k}-1},\phantom{\rule{2em}{0ex}}& \hfill & \phantom{\rule{2em}{0ex}}\\ \hfill & \phantom{\rule{2em}{0ex}}& \hfill \end{array}$$

so that $x{w}_{k}^{\prime}\left(x\right)$ is non-decreasing for *x *> 0. For *y * (-*ε*, +∞),

$${w}_{k}^{\prime}{q}_{k}\left(y\right)\ge {w}_{k}^{\prime}{q}_{1}\left(y\right)\ge {w}_{k}^{\prime}{q}_{1}\left(-\epsilon \right)={w}_{k}^{\prime}\left(2-\epsilon \right)$$

and hence

$$\underset{n\to \infty}{\text{lim}\text{inf}}\underset{y-\epsilon}{\text{inf}}{q}_{n}^{\prime}\left(y\right)\ge \underset{{w}_{k}^{\prime}}{\overset{\left(2-\epsilon \right)}{k=1\infty}}$$

Taking 0 <*ε *< 1, the proof will be completed by demonstrating for any *x *> 1 that

$$\underset{{w}_{k}^{\prime}}{\overset{\left(x\right)}{k=1\infty}}$$

Taking the logarithm gives

$$\begin{array}{llll}\hfill \sum _{k=1}^{{}_{\infty}}\left({2}^{-k}-1\right)\text{log}\left(2{x}^{-{2}^{k}}+1\right)& >-\sum _{k=1}^{\infty}\text{log}\left(2{x}^{-{2}^{k}}+1\right),\phantom{\rule{2em}{0ex}}& \hfill & \phantom{\rule{2em}{0ex}}\\ \hfill & >-\sum _{k=1}^{\infty}2{x}^{-{2}^{k}},\phantom{\rule{2em}{0ex}}& \hfill & \phantom{\rule{2em}{0ex}}\\ \hfill & \phantom{\rule{2em}{0ex}}& \hfill \end{array}$$

and this series clearly converges by the ratio test.

In relation to Proposition 1, we note from [13] that the number of rooted strictly bifurcating trees without a given subtree is asymptotically smaller than the number of all graph-theoretic trees (see also [32] for more about the enumeration of general trees without a given subtree), but this is not enough for our purposes. We needed to show that it is asymptotically smaller than the space of all rooted strictly bifurcating trees. The generating function for *t*_{n }seems first to have been investigated in [15] in connection with enumerating "types of arrangements" in a commutative but non-associative algebra, such as *a*_{1}(*a*_{2}(*a*_{3}*a*_{4})) or (*a*_{1}*a*_{2})(*a*_{3}*a*_{4}); these are identical to rooted bifurcating trees in the "Newick" format [1]. The recursion behind the generating function has been re-discovered independently several times such as in [33] - see [34] for a discussion and several further references. We remark that numerically iterating the quantity *q*_{n }of the proof converges quickly to the value of *ρ*^{-1 }calculable by other means [26,35]. We also observe that the methods of [27-31] can be used to show, in the notation of the proof, that $\underset{n\to \infty}{\text{lim}}{n}^{3/2}{\rho}_{a}^{n}{s}_{n}={\eta}_{a}$ for some positive constant *η*_{a }and hence lim_{n→∞}(*ρ*_{a}/*ρ*)^{n}(*s*_{n}/*t*_{n}) = *η*_{a}/*η*, but we don't pursue this matter here.

Proposition 1 says nothing about the rate of convergence of the fraction. Here we investigate this rate using computation. The characteristic polynomials for the generalized Laplacian were calculated via a doubly-recursive algorithm to enumerate matchings. The characteristic polynomials for the distance matrices were calculated via the Leverrier-Faddeev algorithm [36].

Table Table11 shows that the fraction of trees with unique spectra does not go to zero very quickly. We can't compute this fraction for large numbers of leaves, but we can get some idea of the convergence by using the recursion relation corresponding to the generating function (5). Figure Figure22 shows the number of trees that do not have one of two subtrees of size seventeen as a subtree. This is an actual fraction that can be used with Proposition 1 in order to prove Theorem 1 for the generalized Laplacian matrix.

The number of trees, the number of spectra for the generalized Laplacian (GLS), and the number of spectra for the distance matrix (DS).

Figure Figure22 shows that this fraction converges extremely slowly, despite the fact that as shown above it is asymptotically equivalent to *β*^{n }for some 0 <*β *< 1. It is important to note, however, that this fraction is probably a very crude upper bound on the fraction of trees that share a spectrum with another tree. As can be seen in Table Table1,1, the actual number not sharing a spectrum goes down considerably more quickly, though it is probably still the case that the vast majority of trees of intermediate size should have their own spectra.

Spectral invariants of matrix formulations of trees are a natural way to quantify the shape of phylogenetic trees. However, in this paper we show that a complete classification of tree shapes using common spectral invariants of generalized Laplacian and distance matrices is not possible. For either of these choices of matrix we show that the fraction of binary trees with a unique spectrum goes to zero as the number of leaves goes to infinity, but the rate of convergence of the above fraction to zero appears to be slow. For the adjacency and Laplacian matrices, we show that the *a priori *more informative immanantal polynomials have no greater power to distinguish between trees.

The authors declare that they have no competing interests.

FAM conceived of the project, proved the theorems, performed the numerical experiments, and wrote the paper. SNE applied the immanant, proved the theorems, and wrote the paper.

SNE is supported in part by NSF grants DMS-0405778 and DMS-0907630, and part of the research was conducted during a visit to the Pacific Institute for the Mathematical Sciences.

- Felsenstein J. Inferring Phylogenies. Sunderland, MA: Sinauer Press; 2004.
- Mooers A, Heard S. Evolutionary process from phylogenetic tree shape. Q Rev Biol. 1997;72:31–54. doi: 10.1086/419657. [Cross Ref]
- Kirkpatrick M, Slatkin M. Searching for evolutionary patterns in the shape of a phylogenetic tree. Evolution. 1993;47(4):1171–1181. doi: 10.2307/2409983. [Cross Ref]
- Agapow P, Purvis A. Power of eight tree shape statistics to detect nonrandom diversification: A comparison by simulation of two models of cladogenesis. Syst Biol. 2002;51(6):866–872. doi: 10.1080/10635150290102564. [PubMed] [Cross Ref]
- Matsen FA. A geometric approach to tree shape statistics. Systematic biology. 2006;55(4):652–661. doi: 10.1080/10635150600889617. [PubMed] [Cross Ref]
- Matsen FA. Optimization over a class of tree shape statistics. IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB) 2007;4(3):506–512. [PubMed]
- Biggs N. Algebraic graph theory. second. Cambridge Mathematical Library, Cambridge: Cambridge University Press; 1993.
- Chung FRK. Spectral graph theory, Volume 92 of CBMS Regional Conference Series in Mathematics. Published for the Conference Board of the Mathematical Sciences, Washington, DC; 1997.
- Semple C, Steel M. Phylogenetics, Volume 24 of Oxford Lecture Series in Mathematics and its Applications. Oxford: Oxford University Press; 2003.
- Gupta A. Embedding tree metrics into low-dimensional Euclidean spaces. Discrete Comput Geom. 2000;24:105–116.
- Matoušek J. Lectures in Discrete Geometry. New York: Springer; 2002.
- Cvetković DM, Doob M, Sachs H. Spectra of graphs. third. Heidelberg: Johann Ambrosius Barth; 1995.
- Schwenk AJ. New Directions in the Theory of Graphs. New York: Acade-meic Press; 1973. Almost all trees are cospectral; pp. 275–307.
- Botti P, Merris R. Almost all trees share a complete set of immanantal polynomials. J Graph Theory. 1993;17(4):467–476. doi: 10.1002/jgt.3190170404. [Cross Ref]
- Wedderburn JHM. The functional equation
*g(x*^{2}) =*2αx*+ [*g*(*x*)]^{2}. Ann of Math (2) 1922;24(2):121–140. doi: 10.2307/1967710. [Cross Ref] - Robinson GdB. Representation theory of the symmetric group. University of Toronto Press, Toronto; 1961. Mathematical Expositions, No. 12.
- Fulton W, Harris J. Representation theory, Volume 129 of Graduate Texts in Mathematics. New York: Springer-Verlag; 1991.
- Simon B. Representations of finite and compact groups, Volume 10 of Graduate Studies in Mathematics. Providence, RI: American Mathematical Society; 1996.
- Sagan BE. The symmetric group, Volume 203 of Graduate Texts in Mathematics. second. New York: Springer-Verlag; 2001.
- Littlewood DE, Richardson AR. Group characters and algebra. Philos Trans Roy Soc London A. 1934;233:99–141. doi: 10.1098/rsta.1934.0015. [Cross Ref]
- Littlewood DE. The Theory of Group Characters and Matrix Representations of Groups. New York: Oxford University Press; 1940.
- Steyaert JM, Flajolet P. Patterns and pattern-matching in trees: an analysis. Inform and Control. 1983;58(1-3):19–58. doi: 10.1016/S0019-9958(83)80056-4. [Cross Ref]
- Graham R, Lovász L. Distance matrix polynomials of trees. Adv Mathematics. 1978;29:60–88. doi: 10.1016/0001-8708(78)90005-1. [Cross Ref]
- Bhamidi S, Evans SN, Sen A. Spectra of large random trees. U.C Berkeley Department of Statistics Technical Report No. 771. 2009. [To appear in
*J. Theoret. Probab*.] - Chailloux E, Manoury P, Pagano B. Sebastopol, CA: O'Reilly 2000; Développement d'applications avec Objective CAML.http://caml.inria.fr/pub/docs/oreilly-book/
- Otter R. The number of trees. Ann of Math (2) 1948;49:583–599. doi: 10.2307/1969046. [Cross Ref]
- Landau BV. An asymptotic expansion for the Wedderburn-Etherington sequence. Mathematika. 1977;24(2):262–265. doi: 10.1112/S0025579300009177. [Cross Ref]
- Harary F, Robinson RW, Schwenk AJ. Twenty-step algorithm for determining the asymptotic number of trees of various species. J Austral Math Soc Ser A. 1975;20(4):483–503. doi: 10.1017/S1446788700016190. [Cross Ref]
- Harary F, Robinson RW, Schwenk AJ. Corrigendum: "Twenty-step algorithm for determining the asymptotic number of trees of various species" [J. Austral. Math. Soc. Ser. A 20 (1975), no. 4, 483-503; MR0406858 (53 #10644)] J Austral Math Soc Ser A. 1986;41(3):325. doi: 10.1017/S1446788700033760. [Cross Ref]
- Bender EA. Asymptotic methods in enumeration. SIAM Rev. 1974;16:485–515. doi: 10.1137/1016082. [Cross Ref]
- Bender EA. Errata: "Asymptotic methods in enumeration" (SIAM Rev. 16 (1974), 485-515) SIAM Rev. 1976;18(2):292. doi: 10.1137/1018045. [Cross Ref]
- Lu T. The enumeration of trees with and without given limbs. Discrete Math. 1996;154(1-3):153–165. doi: 10.1016/0012-365X(95)00041-T. [Cross Ref]
- Etheringtion I. Non-associate powers and a functional equation. Math Gaz. 1937;21:36–39. doi: 10.2307/3605743. [Cross Ref]
- Olds CD, Becker HW. Advanced Problems and Solutions: Solutions: 4277. Amer Math Monthly. 1949;56(10):697–699. doi: 10.2307/2305574. [Cross Ref]
- Harding EF. The probabilities of rooted tree-shapes generated by random bifurcation. Adv Appl Probability. 1971;3:44–77. doi: 10.2307/1426329. [Cross Ref]
- Hou SH. A simple proof of the Leverrier-Faddeev characteristic polynomial algorithm. SIAM Rev. 1998;40(3):706–709. doi: 10.1137/S003614459732076X. [Cross Ref]

Articles from Algorithms for Molecular Biology : AMB are provided here courtesy of **BioMed Central**

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |