PMCCPMCCPMCC

Search tips
Search criteria 

Advanced

 
Logo of plosonePLoS OneView this ArticleSubmit to PLoSGet E-mail AlertsContact UsPublic Library of Science (PLoS)
 
PLoS One. 2013; 8(3): e56859.
Published online 2013 March 11. doi:  10.1371/journal.pone.0056859
PMCID: PMC3594297

Edge Principal Components and Squash Clustering: Using the Special Structure of Phylogenetic Placement Data for Sample Comparison

Ahmed Moustafa, Editor

Abstract

Principal components analysis (PCA) and hierarchical clustering are two of the most heavily used techniques for analyzing the differences between nucleic acid sequence samples taken from a given environment. They have led to many insights regarding the structure of microbial communities. We have developed two new complementary methods that leverage how this microbial community data sits on a phylogenetic tree. Edge principal components analysis enables the detection of important differences between samples that contain closely related taxa. Each principal component axis is a collection of signed weights on the edges of the phylogenetic tree, and these weights are easily visualized by a suitable thickening and coloring of the edges. Squash clustering outputs a (rooted) clustering tree in which each internal node corresponds to an appropriate “average” of the original samples at the leaves below the node. Moreover, the length of an edge is a suitably defined distance between the averaged samples associated with the two incident nodes, rather than the less interpretable average of distances produced by UPGMA, the most widely used hierarchical clustering method in this context. We present these methods and illustrate their use with data from the human microbiome.

Introduction

Samples from microbial communities are complex, often containing millions of bacteria that differ to varying degrees. With high-throughput environmental sequencing, one can get a direct estimate of the composition of these microbial populations, even for microbes that cannot be cultured. Such estimates of composition can be too complex to compare directly, and so researchers have developed various ways of comparing populations. One option is to classify the collection of sequencing reads taxonomically, or group the reads into “operational taxonomic units” (OTUs) and then use a discrete comparison index such as the Jaccard index [1] to obtain a distance between samples. A shortcoming of such an approach is that it ignores the degree to which taxonomic labels represent similar or quite different organisms.

In 2005, Lozupone and Knight proposed a phylogenetics-based method to compute distances between samples that takes the natural hierarchical structure of the data into account. Their method, unweighted UniFrac [2], was followed by weighted UniFrac in 2007 [3] to incorporate abundance information. A key feature of both distances is that differences in community structure due to closely related organisms are weighted less heavily than differences arising from distantly related organisms. The UniFrac methodology can powerfully differentiate communities of interest in a variety of settings [4][6]; the papers describing the UniFrac variants have hundreds of citations as of the beginning of 2012. We have recently shown that the classical earth-mover's distance (a.k.a. Kantorovich-Rubinstein (KR) metric) [7] generalizes the weighted UniFrac distance.

Once distances have been computed between samples using UniFrac, these distances are typically fed into general-purpose ordination and clustering methods, such as principal coordinates analysis and UPGMA. Although it is appropriate to apply such techniques to distance matrices of this sort, the classical methods do not use the fact that the underlying distances were calculated in a specific manner, namely, on a phylogenetic tree. Consequently, in an application of principal components analysis, it is difficult to describe what the axes represent. Similarly, in hierarchical clustering, it is unclear what is driving a certain agglomeration step; although it can be explained in terms of an arithmetic operation, a certain amount of interpretability in the original phylogenetic setting is lost.

In this paper, we propose ordination and clustering procedures specifically designed for the comparison of microbial sequence samples that do take advantage of the underlying phylogenetic structure of the data. The input for these methods are collections of assignments of sequencing reads to locations on a “reference” phylogenetic tree: so-called phylogenetic placements. These placements may be obtained by software specialized to do model-based placement [8], [9], by using BLAST on a database built from the leaf sequences, or by clustering the sequences first and then building a tree on representative sequences as is commonly done for UniFrac.

Our edge principal components analysis (edge PCA) algorithm applies the standard principal components construction to a “data matrix” generated from the differences between proportions of phylogenetic placements on either side of each internal edge of the reference phylogenetic tree. Our squash clustering algorithm is hierarchical clustering with a novel way of merging clusters that incorporates information concerning how the data sit on the reference phylogenetic tree.

The results of the analyses can be readily visualized and understood. The principal component axes of edge PCA can be pictured directly in terms of the reference phylogenetic tree, thereby attaching a clear interpretation to the position of a data point along that axis (Fig. 1). Edge PCA is also capable of picking up minor — but consistent — differences in collections of placements between samples: a feature that is important in our example application. The squash hierarchical clustering method is such that each vertex of the clustering tree is associated with a specific distribution of mass on the phylogenetic tree; the length of an edge in the clustering tree has a simple interpretation as the distance between the mass distributions associated with the two incident vertices (Fig. 2).

Figure 1
A graphical representation of the operation of edge principal components analysis (edge PCA).
Figure 2
A visual depiction of the squash clustering algorithm.

Edge PCA provides complementary information to a more traditional application of PCoA or NMDS to a distance matrix derived from UniFrac. Indeed, PCoA/NMDS gives a overall picture of how the biological samples compare in terms of overall similarities and differences, whereas edge PCA selects specific lineages that are high variance and compares the samples on that basis. This difference can be seen clearly in our example application.

The work presented here is distinct from recent work on data analysis methods for sets of trees. PCA on sets of trees has been developed in two contexts. Wang and Marron [10] have developed PCA on unlabeled planar trees, while Nye [11] has developed a PCA for phylogenetic trees with branch lengths and leaf labels. Those methods have the trees themselves as underlying objects of study; edge PCA, in contrast, takes vectors of edge weights on a single tree as input.

The work presented here shares some intent with double principal components (DPCoA) analysis as applied to distributions of phylotypes on a phylogenetic tree [12], [13]. The idea of a DPCoA analysis is to perform a principal components analysis on the phylotype abundance table in a way that down-weights differences between species that are close to one another on the phylogenetic tree. As such, it is somewhat similar to doing multidimensional scaling or principal components on the pairwise distance matrix generated by a UniFrac/KR analysis. It differs from the methods presented here because it only uses the tree in the form of a pairwise distance matrix; consequently it cannot leverage the edge-by-edge structure of the tree.

There are also some connections between edge PCA and the statistical comparison features of MEGAN [14] and LEfSe [15] in that the structure of a tree is used as part of a comparative framework. Our method and these methods all highlight regions of the tree for which important differences exist between samples. However, MEGAN and LEfSe work in the setting where one is explicitly trying to find statistically meaningful differences between pre-labeled sets of samples. The edge PCA algorithm, on the other hand, is an exploratory technique that does not attempt to make a hypothesis-testing statistical statement.

The ordination and clustering methods presented here are implemented in the guppy binary as part of the pplacer package, available at http://matsen.fhcrc.org/pplacer/. The methods take the recently-standardized JSON format for phylogenetic placements [16] as input. A tutorial and demonstration applying these methods can be found at http://fhcrc.github.com/microbiome-demo/.

Results

Intuitive presentation of methods

Here we give a simple overview of the two methods presented in this paper. The starting point for the methods is a collection of mappings of sequences onto a phylogenetic tree. This may be done by clustering sequences and building a tree de novo, by assigning sequences to the leaves of the tree using BLAST, or by mapping sequences into edges of the tree using model-based “phylogenetic placement” methods.

Edge principal component analysis

Edge PCA is easily explained in the context of classical PCA, with the usual interpretation of PCA as a method to find a weighted sum of variables that maximizes variance. Edge PCA does a transformation such that the variables of interest are indexed by the edges of the tree, and these variables are then fed into the standard PCA machinery (Fig. 3). The consequent variable weightings can then be visualized on the tree (Fig. 4 and Fig. 5), and the samples can be plotted in the corresponding space (Fig. 6).

Figure 3
How the edge PCA algorithm works.
Figure 4
The first principal component for the combined vaginal data, representing about 56 percent of the variance.
Figure 5
The second principal component for the combined vaginal data, representing about 24 percent of the variance.
Figure 6
Edge principal components analysis (edge PCA) applied to the combined Forney and Fredricks data set and plotted separately.

More intuitively, this process finds edges of the tree across which there is a high level of between-sample heterogeneity. That is, it finds those edges such that there are lots of reads on one side of the edge in a subset of the samples, and lots of reads on the other side of the edge in the complement of that subset. Those edges are then given a signed weight according to how strong this effect is. The sign of an edge considered in isolation is arbitrary, but the relative signs of any two edges indicate the extent of their anti-correlation in the between-sample heterogeneity. For example, if reads being mapped on the root side of one edge is significantly correlated with reads being mapped on the leaf-side of another edge, these edges will have different signs. The vector made in this manner, with the magnitudes of entries being determined by the level of between-sample heterogeneity, and the relative signs being determined by (anti-)correlations, is the first principal component vector. The second principal component is built in the same manner but after projecting out the first principal component, and so on.

In our visualization tool, each principal component eigenvector is represented by a single colored and thickened reference tree: the thickness of an edge is proportional to the magnitude of the corresponding entry of the eigenvector and the color specifies the sign of that entry (Fig. 4 and Fig. 5). For the trees shown here, orange signifies a positive entry, while green represents a negative entry.

Then, the projection of a given sample onto the plane is determined by the distribution of reads in the sample relative to the weighted edges. Specifically, a read on the leaf side of an edge with a positive weight will move the sample in the positive direction along that principal component, while a read on the root side will move it in a negative direction (Fig. 1). For edges with a negative weight the situation is reversed.

This behavior is achieved by a simple transformation of the data before applying the classical PCA machinery (Fig. 3). The first step is to build one vector per sample indexed by the edges of the tree filled by the “imbalance” between the fraction of reads on either side of that edge. This imbalance is defined, for a given edge An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e003.jpg and sample An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e004.jpg, by cutting the tree in two by removing An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e005.jpg (and any associated placements), then taking the difference between the number of reads of An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e006.jpg in the part of the tree containing the root minus the fraction in the part of the tree not containing the root. Edge PCA is then simply standard principal components analysis applied to the samples-by-edges data matrix created in this way. Namely, we construct the An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e007.jpg covariance matrix of this data matrix and then calculate its eigenvalues and their corresponding eigenvectors. Each eigenvector can be displayed on the tree, because the coordinates of the eigenvector correspond to internal edges of the tree. A large entry in an eigenvector corresponding to one of the bigger eigenvalues identifies an edge across which there is substantial heterogeneity among the associated set of mass differences (see Methods). Moreover, we can project each sample onto an eigenvector to visualize how the sample is spread out with respect to that “axis” (Fig. 1 and and66).

A significant emphasis of the edge PCA methodology is to obtain clearly interpretable axes for projection, and this is easiest when the eigenvectors have distinct sets of nonzero entries. When that is the case, a read in a certain region of the tree will move the corresponding sample point in one direction only. The support of a vector is the set of nonzero indices of that vector, thus the degree to which nonzero entries of principal components appear on shared edges will be called support overlap. We describe two means of support overlap minimization: one is rotating the principal component axes in the plane that they span, and the other is an explicit penalization scheme.

The rotation support overlap minimization simply rotates the principal components in the space that they span. For example, the rotation of the two principal components An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e008.jpg is An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e009.jpg. This rotation can greatly decrease the overlap of the support vectors, for example the pair of vectors An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e010.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e011.jpg when rotated become An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e012.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e013.jpg. By rotating in the space spanned by the principal component eigenvectors, the projection of the points in the principal component space are correspondingly rotated, thus preserving the relative positions of the points in the principal component space. Although it preserves their relative positions, it does lose the original meaning of the principal component vectors: for example, the first dimension in this rotated space is no longer the component of maximal variance, although the proportion of the total variance in the subspace spanned by first An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e014.jpg vectors is unchanged. Nevertheless, we have found this rotation to be useful for finding structure in edge PCA applications.

The second approach is to explicitly penalize the overlap between the second eigenvector and the first by subtracting out a measure of their overlap. As described in the Methods section, we have defined the second “penalized component” as having the second eigenvector An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e015.jpg be chosen to maximize An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e016.jpg for some positive An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e017.jpg, where An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e018.jpg is the covariance matrix and An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e019.jpg is the first eigenvector. However, we have not had tidy results using this explicit penalization, possibly because the first principal component is fixed and the second is then modified to avoid overlap with the first.

Squash clustering

Squash clustering is a type of hierarchical clustering that also uses the structure of the data to visualize what is happening with the clustering in more detail than is possible using a distance matrix only. The starting point is, as before, a collection of reads placed on a phylogenetic tree. Such a collection may be thought of as a distribution of a unit amount of mass across the tree. In the simplest setting, for a collection of An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e020.jpg placements on a tree each read is given mass An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e021.jpg; that mass is assigned to the “best” position for that read on the tree. Another option is to distribute the An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e022.jpg mass for a given read across the tree in proportion the posterior probability of assignment of that read to various positions (see Methods).

This mass distribution may be used to produce distances between collections of phylogenetic placements. Given two samples for a given locus, each sample is placed individually on the phylogenetic tree, and so each sample is thought of as a distribution of mass on the tree. The Kantorovich-Rubinstein (KR) or “earth-mover's” distance may then be used to quantify the difference between those two samples. This distance is defined rigorously in [7], but the idea is simple to explain. Imagine that the phylogenetic tree is a road network and that each sample is represented by the distribution of a unit of mass into piles of dirt along this road network. The distance between two samples is then defined to be the minimal amount of “work” required to move the dirt in the first configuration to that in the second configuration (in this context the amount of work needed to move an infinitesimal mass An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e023.jpg a distance An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e024.jpg is defined to be An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e025.jpg). Thus, similar collections of phylogenetic placements result in similar dirt pile configurations that don't require much mass movement to transform one into the other, while quite different collections of placements require that significant amounts of mass must move long distances across the tree. This distance is classical, having roots in 18th century mathematics, and is a generalization of the weighted UniFrac distance [3], [7].

Squash clustering is hierarchical clustering using the KR distance but with a different way of using merged clusters: rather than taking averages of distances as is done in average-linkage clustering (also known as UPGMA), in squash clustering one takes distances between averages of samples. That is, given a collection of mass distributions on the reference phylogenetic tree, each of which correspond to a cluster that has been built at some stage of the procedure, when the procedure merges two clusters one simply takes a weighted average of the two corresponding mass distributions to get the mass distribution that corresponds to the new, larger cluster (Fig. 2). The “squash” terminology describes this averaging procedure: the original mass distributions for a given cluster are stacked on top of one another and then “squashed” down to produce a new object with unit total mass.

Every internal vertex of the clustering tree is associated with a distribution of mass on the phylogenetic tree, i.e. the squashed mass for the samples below that vertex. The length of an edge between two arbitrary adjacent vertices on the tree can be computed by using the KR distance between the distributions of mass corresponding to those vertices. This edge length calculation gives the resulting trees an appearance that differs from that of UPGMA trees because the lengths of the paths from the root to the various leaves are no longer all the same (i.e. the tree is typically not ultrametric).

The results of a squash clustering procedure are more transparent than the equivalent runs of other distance-based clustering procedures. Because of the merging process, each step of squash clustering operates on exactly the same type of mathematical object: a mass distribution on a phylogenetic tree. These mass distributions can be visualized, revealing the similarities that are driving a particular clustering step (Fig. 2).

In contrast, for UPGMA or other distance-based hierarchical clustering techniques, the internal nodes are represented by fundamentally different sorts of objects than the leaves. The internal nodes for the classical methods are represented by an agglomeration of points, and hierarchical clustering variants all have different ways of using the collection of between-point distances to compute distances between agglomerations of points. Consequently, it is not possible to find a manifestation of an internal node (like the equivalent of one of the mass distributions in Figure 2) where the distances to that manifestation are the distances used to create the clustering tree.

These internal node visualizations are automatically generated by the software implementation of the squash clustering algorithm. An example application of both edge PCA and squash clustering can found in our tutorial at http://fhcrc.github.com/microbiome-demo/.

Example application: the vaginal microbiome

In this section we apply our clustering and ordination methods to pyrosequencing data from the vaginal microbiome. The “Fredricks” data set consists of sequence information from swabs taken from 242 women from the Public Health, Seattle and King County Sexually Transmitted Diseases Clinic between September 2006 and June 2010 of which 222 samples resulted in enough material to analyze [17] (Sequence Read Archive submission SRA051298). DNA was extracted and the 16S gene was amplified in the V3–V4 hypervariable region using broad-range primers and sequenced using a 454 sequencer with FLX chemistry. Sequences were pre-processed using the R/Bioconductor [18], [19] package microbiome. The “Forney” data set is an analogous data set of 454 reads from the V1–V2 hypervariable region amplified from vaginal swabs [20]. These sequences were downloaded as Sequence Read Archive submission SRA022855. The stability of reads from different regions of the same gene is the subject of a manuscript under preparation.

A custom maximum likelihood reference tree consisting of relevant sequences from RDP [21] and a local collection was built using RAxML 7.2.7 [22] using the GTR+4An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e026.jpg model as described in [17]. Sequences were aligned with Infernal v1.0.2 [23], and placed into this tree using pplacer [9] with the default parameter choices along with the -p and –inform-prior options.

The principal components for the vaginal samples independently recover previous knowledge about the contribution of certain microbial species to distinct types of vaginal microbial environment. A microscopic examination of Gram-stained specimens from the vaginal mucosa can be used to define a diagnostic criterion called the Nugent score. The Nugent score ranges from 0 to 10, with a high number indicating bacterial vaginosis (BV). The scoring criteria include a relative paucity of gram-positive rods described as Lactobacillus morphotypes, an abundance of bacteria resembling Gardnerella and Bacteroides species (small gram variable and gram negative rods, respectively), and an abundance of curved gram-negative rods [24]. The edge principal component algorithm appears to both agree with and extend these microscopic criteria: the first principal component for the vaginal data set identifies a negative association between Lactobacillus species versus species belonging to Sneathia and Prevotella (both gram-negative rods) and Megasphera (gram negative cocci) (Fig. 4). Both Prevotella and Megasphera have been independently identified as prevalent members of the vaginal microbiome, and are associated with a clinical diagnosis of BV [20], [25]. The second principal component reveals that important differences between samples exist at the species level. Indeed, it highlights the substantial amount of heterogeneity between the amount of two Lactobacillus species observed: L. iners and L. crispatus (Fig. 5). This latter observation is interesting from the medical perspective, as the Nugent criteria attribute the same significance to all Lactobacillus morphotypes regardless of species. In the context of the edge PCA analysis, however, a distinction is made between the two Lactobacillus species based on the population distribution of other organisms in the sample.

The samples from the two studies form a revealing pattern when plotted on these axes along with the corresponding Nugent score (Fig. 6). As described above, samples on the left side have Sneathia and Prevotella and lack Lactobacillus while those on the right side have the opposite. Samples on the bottom have lots of L. iners and a small amount of L. crispatus, while those on the top have the opposite.

Lactobacillus is associated with a low Nugent score and thus a negative BV diagnosis; in the results presented here L. crispatus dominated samples are not found to have a high Nugent score (indicating BV), while L. iners dominated samples sometimes are. In both the Forney and the Fredricks data sets, the samples with the highest Nugent score lie on a continuum of samples from the left to the lower right (from Sneathia/Prevotella to L. iners -dominant). A similar pattern is observed when the samples are divided by race (Fig. S2). Reviewing the taxonomically classified data from the Fredricks study confirms this trend. These plots indicate the possibility of a medically relevant difference between these two Lactobacillus species in a pattern that is consistent between two large, independent studies. It is also significant that phylogenetic placement on a reference tree containing full-length 16S rRNA gene sequences allows a direct comparison between the two data sets despite the fact that each sequenced a different region of the 16S rRNA gene. We emphasize that the PCA was not informed of either the Nugent score associated with the specimens or the taxonomic classifications of the sequences.

Principal coordinates (PCoA) and multidimensional scaling (MDS) form a complementary set of techniques to edge principal components. PCoA applied in this context (Fig. 7) demonstrates two important facts about the vaginal specimens (a similar picture results from MDS; results not shown). First, it is clear that the BV negative (small Nugent score) specimens are very similar to one another in composition, and that the BV positive (high Nugent score) specimens are different from one another. This information is not recovered by the edge PCA analysis, which instead finds interesting structure within the BV negative specimens. This example emphasizes the complementary nature of edge PCA and these more classical methods, where the former gives specific information about the changes of the relative proportions of phylogenetic groups, whereas the latter gives a comparison of the overall composition.

Figure 7
Principal coordinates analysis applied to the Fredricks vaginal data set.

Squash clustering was applied to the collection of vaginal samples in the Fredricks data set. As we have already remarked, because meaningful internal edge lengths can be assigned to the squash clustering tree, it is not ultrametric, whereas the UPGMA tree is (Fig. 8). The two tight clusters at the bottom of (a) and (b) contain the Lactobacillus -dominated vaginal samples seen on the left side of (Fig. 6) and correspond to L. iners (upper tight cluster) and L. crispatus (lower tight cluster). A more detailed leaf-labeled comparison between the two trees is available in the supplementary material (Fig. S1).

Figure 8
The results of (a) squash clustering and (b) UPGMA as applied to the vaginal data.

Squash clustering simulation study

It is difficult to find a collection of microbial communities that have a known hierarchical structure, thus simulation was used to validate the effectiveness of the squash clustering methodology. The simulation process is described in detail in the Methods section, but we highlight several important points here. The primary ingredients for the simulation are a fixed “clustering tree” representing the hierarchical relationship between a set of communities and a “reference tree” of species as above. The simulation generates artificial collections of placements on the reference tree for each leaf of the clustering tree. The success of the clustering algorithms is judged by comparing the original clustering tree to the result of the clustering method applied to the artificial collections of placements. This accuracy comparison is done using the rooted Robinson-Foulds (RF) metric (Methods).

A number of parameters determine the steps in the simulation process. Every internal node of the clustering tree is associated with a “reconstructability” parameter; this parameter determines the level of similarity between descendants of that internal node. In this simulation, the reconstructability parameter is set to a single value for all internal nodes of the tree.

Our simulations show that squash clustering and UPGMA applied to KR distances perform similarly across a wide range of simulation parameters (Fig. 9). Not only do the squash clustering and UPGMA methods have similar levels of accuracy, but their results are also topologically quite similar to one another. Thus squash clustering, with its more transparently meaningful branch lengths, may prove to be an attractive choice for researchers wishing to find hierarchical structure in their data.

Figure 9
The results of the cluster accuracy simulation experiment using the rooted Robinson-Foulds (RF) metric.

Discussion

Conclusions

Direct nucleic acid sequencing from environments – ranging from the human body to acid mine drainages – has revolutionized our understanding of the microbial world. In parallel, computational techniques have made great leaps forward in their capacity to classify reads taxonomically [26] and map them onto phylogenetic trees [8], [9]. There has also been a considerable amount of work on useful ways to derive distances between samples [2], [3], [27].

In our paper we have established a new method, “edge principal components analysis” (edge PCA), that associates the principal components axes with signed weightings on the edges of a phylogenetic tree of the species under consideration. By using colors and thickness to visualize these weightings, the user can gain an understanding of what phylogenetic factors drive the separation of the samples. Because the comparison is done in an explicitly phylogenetic context, edge PCA can pick up consistent differences between samples that are subtle from a distance-based standpoint but are readily apparent from the richer tree-based one.

We have also developed a variant of UPGMA, “squash clustering”, that enables visualization of the internal nodes of clustering trees. Because the clustering is done directly on the type of mathematical object that are being visualized, one gains insight into what is driving a particular clustering step.

In this paper we describe these methods and demonstrate their practical effectiveness via an application to vaginal microbiome data. We present simulation results demonstrating the effectiveness of the squash clustering technique in recovering hierarchical structure. In the Methods section, we explain the methods more formally, offer theory connecting these new techniques each other, and show consistency of squash clustering in a simple setting.

In future work we will apply the basic step of the edge principal components method — transforming phylogenetic placement samples into vectors indexed by the edges of the tree — in other contexts. In this paper, we followed this transformation with an application of principal components analysis, but many other options are possible. Our next step will be to apply classical supervised learning techniques to similarly transformed data.

Generalization and limitations

The methods described here, although implemented for comparison of microbial communities, may in fact be used in more general settings. Edge PCA may be used whenever each sample can be represented by a collection of mass distributed over a common tree structure. Squash clustering may be applied in any case where there is a well-defined notion of the distance between two samples and a well-defined procedure for averaging two samples to produce another object of the same type.

There are some limitations to the sort of comparisons that can be performed using these methods simply because the underlying data is a collection of phylogenetic placements on a tree. For example, if a clade of the reference tree is missing, then differences in read distribution within that clade are not be accounted for in the comparison. Such issues will be present whenever a reference tree is being used, whether using phylogenetic placements directly or mapping reads to the tree using BLAST as a preliminary step in a UniFrac analysis. This disadvantage is balanced by the advantage of not having to define operational taxonomic units (OTU's) by clustering, which can be sensitive to methodological parameters [28].

We also note that the algorithm is influenced by the level of taxon sampling in various regions of the reference tree in such a way that more highly sampled lineages will be assigned comparatively more weight in the PCA analysis than less sampled lineages. This is simply because increasing the level of sampling that produced the reference tree in some region can turn a single edge into multiple edges, and the difference in mass assigned to the single variable that corresponded to the “old” edge is now replicated for each of the variables that correspond to the “new” edges. It can be seen from the variational characterization of the corresponding eigen-problem (see Methods) that the sum of the magnitudes of the eigenvector components corresponding to the “new” edges will be typically greater than the magnitude of the eigenvector component corresponding to the “old” edge. This does not change the interpretation of the location of points relative to the weightings on a tree, however, it does mean that highly sampled lineages may have a disproportionate influence on the construction of the principal components. We are currently developing an alternate formulation that uses mass differences on either side of each point in the reference tree in a manner analogous to the way edge PCA uses mass differences on either side of each edge. The new formulation does not treat all edges as being on an equal footing; rather, it implicitly incorporates information about edge lengths. This “length PCA” procedure will therefore not be perturbed by taxon sampling levels in the same way.

The methods presented here also depend on the number of phylogenetic placements being correlated with the number of organisms of that type found in the sample. This is not always true. Loci such as 16S are often sequenced by first amplifying using a polymerase chain reaction with a broad-spectrum primer; this primer may have different efficiencies for different organisms, or may miss certain organisms altogether. In addition, genetic material extraction efficiency varies by organism [29]. Nevertheless, the results on this example data using our methods do correspond with analyses made with non-genetic methods such as morphological comparison (Fig. 6).

Methods

General setting for methods

Phylogenetic placement is a way to analyze the results from high-throughput sequencing applied to DNA extracted in bulk from an environmental sample of microbes. It is simply the assignment of sequencing reads to a “reference” phylogenetic tree constructed from previously-characterized DNA sequences; recent algorithms have focused on doing so according to the phylogenetic maximum-likelihood criterion [8], [9]. By fixing a reference tree rather than attempting to build a phylogenetic tree for the sample from scratch, recent algorithms of this type are able to place tens of thousands of query sequences per hour per processor on a reference tree of one thousand taxa (e.g. species), with performance scaling linearly in the number of reference taxa, the number of query sequences, and the length of the query sequences.

A probability measure on the reference phylogenetic tree is obtained from a collection of sequence reads as follows. A given read can be assigned to the phylogenetic tree in its maximum likelihood or maximum posterior probability location using the phylogenetic likelihood criterion to obtain a “point placement.” A point placement can be thought of as a probability measure with all of the mass concentrated at the best attachment location. Alternatively, one can express uncertainty in the optimal location by spreading the probability mass according to posterior probability (assuming some priors) or “likelihood weight ratio”; see [9] for details. In either case, each read is thought of as a probability measure on the reference phylogenetic tree. A probability measure for a collection of reads can be obtained by averaging the measures for each read individually (that is, by constructing the probability measure that is the mixture of the probability measures for each read in which each such measure is given an equal weight).

Edge principal components analysis

Begin with a phylogenetic tree An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e033.jpg and probability measures An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e034.jpg on An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e035.jpg, each of which comes from an assignment of the reads in one of An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e036.jpg samples to the phylogenetic tree, as described above. If An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e037.jpg is not already rooted at some vertex, pick an arbitrary vertex to be the root. Removing a given internal edge An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e038.jpg from the tree splits An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e039.jpg into two components: An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e040.jpg containing the root and An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e041.jpg without. For a probability measure An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e042.jpg on An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e043.jpg, define the corresponding edge mass difference

equation image

Suppose that An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e045.jpg has An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e046.jpg internal edges. The edge mass difference matrix An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e047.jpg is the An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e048.jpg matrix that has the vectors of edge mass differences for the successive samples as its rows. Edge principal components analysis is then performed by first deriving the An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e049.jpg covariance matrix An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e050.jpg from the matrix An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e051.jpg of “observations” followed by computing the An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e052.jpg eigenvectors of An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e053.jpg ordered by decreasing size of eigenvalue (Fig. 3).

Each resulting eigenvector is then a signed weighting on the internal edges of the tree, and these weightings may be used to highlight those edges of the tree for which there is substantial between-sample heterogeneity in the masses assigned to the two components of the tree defined by the edge. Indeed, recall the variational characterization of the eigenvectors An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e054.jpg of an An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e055.jpg non-negative definite matrix An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e056.jpg listed in order of decreasing eigenvalue:

equation image

where An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e058.jpg is the usual Euclidean length of the vector An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e059.jpg, An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e060.jpg is the usual Euclidean inner product of the vectors An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e061.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e062.jpg, and An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e063.jpg indicates that An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e064.jpg is perpendicular to each of the vectors An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e065.jpg. Thus, an edge that receives a weight with large magnitude from an eigenvector corresponding to one of the bigger eigenvalues of the covariance matrix An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e066.jpg may be viewed as an edge across which there are substantial dissimilarities between samples in the amount of mass placed in the components on either side of the edge.

In our visualization tool, each eigenvector is represented by a single colored and thickened reference tree: the thickness of an edge is proportional to the magnitude of the corresponding entry of the eigenvector and the color specifies the sign of that entry (Fig. 4 and Fig. 5). For the trees shown here, orange signifies a positive entry, while green represents a negative entry. Moreover, we can project each sample onto an eigenvector to visualize how the sample is spread out with respect to that “axis” (Fig. 6).

When considering the weight assigned to a single edge in isolation, only the magnitude of the weight matters and not the sign, because if An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e067.jpg is an eigenvector for a particular eigenvalue, then so is An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e068.jpg. However, sign matters when comparing the weights assigned to two or more edges: if the edge mass differences for two edges are strongly negatively associated, then the corresponding entry of the covariance matrix will be very negative, and the corresponding two entries of the eigenvector for a large eigenvalue will have different signs.

Changing the chosen root from vertex An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e069.jpg to vertex An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e070.jpg does not affect the eigenvalues or the magnitudes of the entries in the corresponding eigenvectors, and it only changes the signs of the entries for the edges between An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e071.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e072.jpg. This may be seen as follows. Note first that if an edge An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e073.jpg is between An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e074.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e075.jpg, then re-rooting flips the sign of An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e076.jpg, whereas An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e077.jpg is remains the same if An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e078.jpg is not between An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e079.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e080.jpg. Define An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e081.jpg to be the diagonal An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e082.jpg matrix such that An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e083.jpg for edges An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e084.jpg on the path between An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e085.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e086.jpg, and An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e087.jpg otherwise. Note that An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e088.jpg. The covariance matrix An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e089.jpg for the re-rooted tree and that for the original tree are related by a similarity transformation: An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e090.jpg. Thus, the eigenvalues for An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e091.jpg are the same as those for An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e092.jpg, and An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e093.jpg is an eigenvector of An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e094.jpg if and only if An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e095.jpg is an eigenvector of An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e096.jpg.

As with classical principal components analysis, the question arises of choosing an appropriate number An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e097.jpg such that the eigenvectors corresponding to the An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e098.jpg largest eigenvalues represent “signal” in the data, whereas the remaining An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e099.jpg eigenvectors represent “noise”. That is, one wishes to choose An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e100.jpg such that the projection of the data onto the subspace spanned by the first An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e101.jpg eigenvectors is a reasonably faithful lower-dimensional summary of the data that does not miss important features. There is no clear-cut, “one-size-fits-all” solution to this problem. The usual approach is to first construct a scree plot that depicts for each An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e102.jpg the proportion of the total variance explained by the An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e103.jpg eigenvector (that is, An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e104.jpg] and the proportion of the total variance explained by the first An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e105.jpg eigenvectors (that is, An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e106.jpg). One then chooses An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e107.jpg so that the there is a substantial jump from the proportion of variance explained by the An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e108.jpg eigenvector to the proportion explained by the An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e109.jpg and, moreover, so that the proportion of variance explained by the first An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e110.jpg eigenvectors is close to An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e111.jpg. Also, a wish to represent the data graphically by plotting the projection onto the subspace spanned by the first An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e112.jpg eigenvectors makes a choice of An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e113.jpg desirable if it is reasonable in terms of the above criteria.

We now shift our attention to support overlap minimization. We will measure overlap of vectors An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e114.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e115.jpg two ways: either in an An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e116.jpg sense by considering An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e117.jpg, or in an An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e118.jpg sense by considering An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e119.jpg. Either of these can be extended to define an overlap of a collection of vectors by considering the sum of their pairwise overlaps.

The rotation idea is simple: rotate the eigenvectors in the space that they span. Specifically, assume that we want to apply this process to the first An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e120.jpg eigenvectors; let An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e121.jpg be the matrix with the first An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e122.jpg eigenvectors as columns. Such a rotation can be obtained by multiplying An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e123.jpg on the right by an arbitrary An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e124.jpg rotation matrix An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e125.jpg; the columns of the resulting matrix are the rotated eigenvectors. The rotation Support Overlap Minimization (SOM) process finds the rotation that minimizes the An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e126.jpg overlap function applied to An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e127.jpg for An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e128.jpg.

One disadvantage of the rotation process is that the axes lose their inherent meaning; for example, the first dimension is no longer the axis of maximal variance. An alternative means of minimizing support overlap is to explicitly penalize the An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e129.jpg overlap. For the second component, this can be done by taking the highest-eigenvalue eigenvector of the matrix An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e130.jpg, where P is the projection onto the orthogonal complement of the span of An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e131.jpg, which can be obtained by power iteration. In that case,

equation image

We have not been as successful with this approach as with the rotation described above; in the examples we have tried a large value of An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e133.jpg is needed to see a significant decrease in overlap, but that leads to an excessive distortion of the principal component vectors.

Squash clustering

Squash clustering is a type of hierarchical clustering using the earth-movers, or Kantorovich-Rubinstein (KR) distance described above. The key difference with other types of hierarchical clustering happens when merging two clusters: we simply take a weighted average of the two corresponding mass distributions to get the mass distribution that corresponds to the new, larger cluster (Fig. 2).

Agglomerative hierarchical clustering in general proceeds by iterating the following sequence of steps until there is a single cluster and a corresponding An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e134.jpg pairwise distance matrix.

  1. Find the smallest off-diagonal element in the current pairwise distance matrix. Say it is the distance between clusters An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e135.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e136.jpg.
  2. Merge the An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e137.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e138.jpg clusters, making a cluster An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e139.jpg.
  3. Remove the An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e140.jpgth and An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e141.jpgth rows and columns from the distance matrix.
  4. Calculate the distance from the cluster An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e142.jpg to the other clusters.
  5. Insert the distances from An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e143.jpg into the distance matrix.

Classical hierarchical clustering methods calculate the distance in step number 4 as some function of the distance matrix. In particular, average-linkage clustering or UPGMA calculates the distance between two clusters as the average between pairs of items in the clusters.

Squash clustering takes the average of the mass distributions and then computes KR distances from the merged cluster to the other clusters. That is, if we merge two clusters that correspond to sets of An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e144.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e145.jpg original mass distributions and are represented by averaged mass distributions An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e146.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e147.jpg, then the new cluster is represented by the mass distribution

equation image

Because these merged clusters are simply mass distributions, one can calculate KR distances as usual for the next stage of clustering. The series of merges in the clustering algorithm determines the topology of the rooted clustering tree that the algorithm produces. Leaves of the tree correspond to individual samples.

The KR distances between these mass distributions can be used to assign branch lengths to the clustering tree. Specifically, each internal node is associated with exactly one mass distribution, and the length of a given branch between two internal nodes An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e149.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e150.jpg is equal to the KR distance between the mass distributions associated with An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e151.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e152.jpg. The mass distributions corresponding to the internal nodes of the phylogenetic tree can be visualized using the software implementation. In contrast, for UPGMA the branch lengths are differences of “heights” that are calculated as certain averages of distances from the original distance matrix. (We note that in the default UPGMA implementation in R, the branch lengths for “pendant” branches leading to leaves are arbitrarily specified by the user and thus the trees may not appear ultrametric.)

In the next section, we investigate connections between edge PCA and squash clustering, compare squash clustering and UPGMA in more detail, and show that squash clustering is consistent given ultrametric data.

Further results

Given probability measures An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e153.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e154.jpg on the rooted tree An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e155.jpg, the Zolatarev-like An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e156.jpg generalization of the KR distance is defined for An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e157.jpg as

equation image
(1)

where An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e159.jpg is the natural length measure on the tree and An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e160.jpg is the subtree on the other side of An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e161.jpg from the root [7]. The classical KR distance is (1) with An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e162.jpg; this is the value that corresponds to weighted UniFrac. It is shown in [7] that choosing a different root does not change the distance. It is also noted there that if An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e163.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e164.jpg only assign mass to leaves of the tree and An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e165.jpg is in the interior of edge An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e166.jpg then

equation image

furnishing a connection with edge PCA.

At each stage of the squash clustering algorithm we have a pairwise distance matrix with rows and columns indexed by the clusters that have already been made by the algorithm. Initially, the clusters are just the individual samples and the entries in the pairwise distance matrix are computed using equation (1).

We now compare UPGMA and squash clustering in more detail. For UPGMA, if clusters An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e168.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e169.jpg containing respective numbers of items An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e170.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e171.jpg are merged to form a cluster An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e172.jpg with An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e173.jpg items, then the average-linkage distance between another cluster An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e174.jpg with An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e175.jpg items and the new cluster An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e176.jpg is (writing An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e177.jpg for the distance between individual items)

equation image

and so the entries of the updated UPGMA distance matrix are just suitably weighted averages of the entries of the previous distance matrix.

At each stage of squash clustering, on the other hand, a cluster is associated with a probability measure on the tree An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e179.jpg. When two clusters An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e180.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e181.jpg containing respective numbers of items An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e182.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e183.jpg and associated with respective probability measures An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e184.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e185.jpg are merged to form a cluster An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e186.jpg, then the new cluster An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e187.jpg is associated with the probability measure An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e188.jpg and the distance from An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e189.jpg to some other cluster An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e190.jpg associated with the probability measure An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e191.jpg is

equation image
(2)

which is analogous to the above equation for the UPGMA averaging procedure. As remarked above, the “squash” interpretation of (2) comes from recalling that the probability measures associated with the two clusters are each simple averages of all of the measures for the items in the clusters (Fig. 2). That is, if An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e193.jpg is the probability measure associated with original item An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e194.jpg, then

equation image

and

equation image

and the probability measure associated with the new cluster An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e197.jpg is

equation image

the (unweighted) average of the probability measures in An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e199.jpg.

A natural question to ask is whether the distance between a probability measure An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e200.jpg and the weighted average of two probability measures An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e201.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e202.jpg is equal to the similarly weighted average of the distance between An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e203.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e204.jpg and the distance between An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e205.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e206.jpg. The answer is in general “no”: starting from (1) we have from the Minkowski inequality that for An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e207.jpg:

equation image

The early iterations of the UPGMA and squash clustering algorithms can be quite similar because the pairs of objects being merged are close together relative to their distance to the other objects. For example, if An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e209.jpg, then the above inequality is an equality whenever An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e210.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e211.jpg have the same sign for all An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e212.jpg.

Consistency of squash clustering on ultrametric data

An appealing feature of UPGMA is that if the pairwise distances which are used to initialize the algorithm are the leaf-to-leaf distances for an ultrametric rooted tree An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e213.jpg, then UPGMA is guaranteed to return An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e214.jpg. In this section we show that squash clustering has a similar property in a simple special case. This observation complements the validation work done using simulation to show that squash clustering does recover hierarchical structure when it is present.

In order to explain the result for squash clustering, we must first review the simple demonstration of the above result for UPGMA.

Imagine that the ultrametric rooted tree An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e215.jpg is oriented on the page with the root at the top and the leaves at the bottom, and for simplicity assume that it is a bifurcating tree. By the assumption of ultrametricity, all the leaves will sit on a horizontal line. Imagine the internal nodes An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e216.jpg of An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e217.jpg are listed in order of increasing distance from the line so that An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e218.jpg is the closest. For simplicity, suppose further that no two of these distances are equal, so that we don't have to adopt an arbitrary convention for breaking ties. Each internal node corresponds to a set of leaves – namely, those that are below it.

We proceed inductively to demonstrate that the merges done by the algorithm reproduce, in order, the sets of leaves below the internal nodes An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e219.jpg and that the distances between clusters assigned by UPGMA agree with the original node-to-node distances in An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e220.jpg. The base case is trivial. Assume the algorithm satisfies the inductive hypothesis for all An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e221.jpg with An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e222.jpg. The two nodes descending from An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e223.jpg in An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e224.jpg are each an internal node of the form An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e225.jpg for some An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e226.jpg or a single leaf. Call the two corresponding sets of leaves below these nodes An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e227.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e228.jpg. By induction, An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e229.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e230.jpg are present among the clusters that have been constructed by UPGMA after the An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e231.jpg merge. The distance in An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e232.jpg between any pair of leaves An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e233.jpg with An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e234.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e235.jpg is the same. By construction, the UPGMA distance between An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e236.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e237.jpg,

equation image

is equal to the distance between any two such leaves An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e239.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e240.jpg. Furthermore, the UPGMA distance between An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e241.jpg (resp. An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e242.jpg) and any other cluster An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e243.jpg present after An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e244.jpg UPGMA merges is equal to the common distance in An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e245.jpg between any leaf in An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e246.jpg (resp. An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e247.jpg) and any leaf in An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e248.jpg. Moreover, by the definition of An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e249.jpg, this common distance is greater than the UPGMA distance between the clusters An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e250.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e251.jpg. It is now clear that the An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e252.jpg merge of UPGMA merges the clusters An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e253.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e254.jpg to produce a cluster that coincides with the set of leaves below An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e255.jpg in An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e256.jpg and that the updating of distances maintains the agreement between node-to-node distances in An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e257.jpg and UPGMA cluster-to-cluster distances.

A similar argument leads to an analogous statement for squash clustering. Again, assume that the reference tree An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e258.jpg is an ultrametric rooted tree. For each leaf An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e259.jpg, assume that there is a single sample An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e260.jpg consisting of a single read mapped to An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e261.jpg. We will show that in this case both squash clustering and UPGMA applied to KR An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e262.jpg distances return the reference tree An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e263.jpg as the clustering tree.

First note that the An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e264.jpg distance between the two samples An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e265.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e266.jpg is simply the distance on the tree between the leaves An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e267.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e268.jpg. These distances are ultrametric by assumption. Thus, UPGMA run with KR distances will return An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e269.jpg as the clustering tree in this case.

Further, squash clustering and UPGMA start with the same clusters (each read in a cluster by itself), every cluster is trivially the set of leaves below a node of the reference tree An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e270.jpg, and the distances between clusters are the same for the two methods. Suppose, then, that after some number of iterations of the two methods we are still in a situation where the two methods have the same clusters available to merge, these clusters are disjoint sets of leaves below nodes of An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e271.jpg, and the distances between the clusters available to merge are the same for the two methods.

Call the available clusters An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e272.jpg. By definition, squash clustering and UPGMA will merge the same pair of clusters – say, without loss of generality, An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e273.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e274.jpg. The An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e275.jpg squash clustering distance is the optimal transport (earth movers') distance between the probability measure that puts mass An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e276.jpg at each leaf of An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e277.jpg and the probability measure that puts mass An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e278.jpg at each leaf of An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e279.jpg for An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e280.jpg. Because, as we remarked above, An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e281.jpg for any An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e282.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e283.jpg, An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e284.jpg, the optimal transport distance is necessarily this common value. Thus, the updating of the distances between the clusters available for merging is the same for the two methods. Therefore, by induction, the trees produced by the two methods will be the same and will coincide with the tree An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e285.jpg.

Simulation methodology for clustering validation

In this section we present methodology for making artificial “samples” that are hierarchically related. These are then used to compare squash clustering to UPGMA. The code for these simulations can be found on the commiesim branch of pplacer at http://github.com/matsen/pplacer/tree/commiesim.

Start with a true “clustering tree” An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e286.jpg: the tree of communities on which we are simulating. Let An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e287.jpg be a phylogenetic “reference” tree of the organisms of interest: the phylogenetic tree of the actual species from which the simulated placements will be drawn. Write An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e288.jpg for the set of leaves of An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e289.jpg. Before describing the simulation we recall some standard terminology. A split of An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e290.jpg is the partition of the leaves An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e291.jpg induced by an edge of An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e292.jpg: it consists of the two subsets of An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e293.jpg of An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e294.jpg that are on either side of the edge. We have An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e295.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e296.jpg, and we use the notation An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e297.jpg to denote that the subsets An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e298.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e299.jpg form a split.

The first step of simulation assigns subsets of An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e300.jpg to the leaves of the clustering tree An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e301.jpg. The elements of each such subset are the organisms found in that particular “community”; the community will then be used to generate simulated placements by sampling some number of members of the community with replacement. For example, suppose that a leaf An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e302.jpg of the clustering tree An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e303.jpg is associated with the set An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e304.jpg of leaves of the reference tree An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e305.jpg; to generate a sampled collection of placements for An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e306.jpg we first sample from An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e307.jpg with replacement. The resulting multi-set of leaves of the reference tree An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e308.jpg is made into a collection of placements by turning each element into a placement consisting of a unit point mass at the given leaf of the reference tree.

These simulated collections of placements are then used to reconstruct the clustering tree by applying either squash clustering or UPGMA on the KR distances.

Subsets of the leaf set An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e309.jpg of An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e310.jpg are assigned to leaves of the clustering tree An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e311.jpg by a recursive procedure that proceeds down the clustering tree beginning with the root An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e312.jpg. At each stage there is a current internal node An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e313.jpg of An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e314.jpg and a set of leaf sets An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e315.jpg associated with An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e316.jpg. The recursion is initialized with An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e317.jpg. We proceed down the tree An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e318.jpg from a node An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e319.jpg in two stages: we first split the set of subsets An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e320.jpg and then assign some of these subsets to each child of An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e321.jpg.

The splitting stage is done by selecting splits (a.k.a. bipartitions) of An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e322.jpg and using them to cut apart the leaf subsets. For example, suppose that An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e323.jpg is the set of subsets of An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e324.jpg associated with the internal node An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e325.jpg of An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e326.jpg that we are currently processing. We select an “effective” split An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e327.jpg of An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e328.jpg i.e. one such that An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e329.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e330.jpg are non-empty for some An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e331.jpg. Applying this split produces the new collection of leaf subsets An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e332.jpg. Each one of the An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e333.jpg corresponds to a connected region of the reference tree An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e334.jpg, and applying an effective split corresponds to disconnecting one of those regions by cutting an edge of An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e335.jpg. In the simulation, we sample an integer An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e336.jpg from a Poisson distribution with mean An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e337.jpg and then sample An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e338.jpg effective splits uniformly with replacement from the set of all effective splits for the subsets in An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e339.jpg. We apply those splits successively as above to split the subsets in An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e340.jpg. This splitting produces a new set of leaf subsets that we call An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e341.jpg.

Next, for each child of the current internal node An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e342.jpg, we select a subset of An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e343.jpg of size An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e344.jpg to pass on to the child. We do this in such a way that An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e345.jpg of the subsets selected are the same for each child, while the remaining An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e346.jpg are selected independently of the corresponding selections for the other children. Here An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e347.jpg is a fixed parameter and An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e348.jpg is a realization of a binomial distribution with number of trials An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e349.jpg and success parameter An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e350.jpg. The “reconstructability parameter” An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e351.jpg determines the level of similarity between the children of An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e352.jpg: for internal nodes with high An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e353.jpg the subsets assigned to its children will be quite similar, while for those with low An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e354.jpg the subsets will tend to be different.

More specifically, suppose that the children of An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e355.jpg are the nodes An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e356.jpg. We first sample An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e357.jpg elements from An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e358.jpg with replacement to make a set An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e359.jpg of subsets of An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e360.jpg with at most An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e361.jpg elements. Next, for An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e362.jpg, we sample An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e363.jpg elements from An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e364.jpg with replacement to make a set An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e365.jpg of subsets of An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e366.jpg with at most An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e367.jpg elements. Then, An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e368.jpg, the set of subsets associated with the node An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e369.jpg, is defined to be the set An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e370.jpg. By recurring in this fashion, every node An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e371.jpg of the clustering tree An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e372.jpg is assigned some set An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e373.jpg of subsets of the set of leaves An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e374.jpg of the reference tree An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e375.jpg. For each leaf An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e376.jpg of the clustering tree, placements are simulated as described above from the set of leaf subsets An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e377.jpg.

For the study reported in Figure 9, the following parameters were used. The clustering tree An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e378.jpg was, in the usual “Newick” bracketing notation for binary rooted trees, the tree An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e379.jpg. The reference tree An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e380.jpg was the tree for microbes in the vaginal environment used in the rest of the paper. 500 trials were performed for every parameter setting, and 100 placements were generated for each clustering leaf of each trial. The mean number of cuts An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e381.jpg was set to 10, and the number of sets selected An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e382.jpg was set to 5. The reconstructability parameters An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e383.jpg for all internal nodes were set to the value specified in the panel label of the figure.

The Robinson-Foulds (RF) metric [30] of two trees An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e384.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e385.jpg was computed as half the size of the symmetric difference of the split-set of An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e386.jpg and that of An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e387.jpg. Because the classical RF distance is calculated on unrooted trees, while the clustering trees in the study are rooted, we attached a fictitious “root leaf” to the root before calculating RF distances to account for the position of the root. We call the resulting quantity the rooted Robinson-Foulds distance. For a bifurcating tree on six leaves such as An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e388.jpg, the maximal rooted RF distance is four.

Supporting Information

Figure S1

A comparison of the clustering results for the Fredricks data using the software of [32]. The software uses the Hungarian (a.k.a. Munkres) algorithm to find an optimal one-to-one matching between edges of the trees minimizing differences in a topological score between pairs of matched branches as follows. Given two trees An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e389.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e390.jpg on the same samples, let An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e391.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e392.jpg be the bipartitions of the samples induced by cutting the edges of An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e393.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e394.jpg. For two bipartitions An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e395.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e396.jpg, one associates an “agreement score” An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e397.jpg describing the proportion of shared elements between the sides of the bipartitions. The algorithm finds a one-to-one matching between An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e398.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0056859.e399.jpg that minimizes the total agreement score between matched bipartitions. Each tree is drawn in a way which shows the agreement scores: a thick branch represents an edge which has a low agreement score with its partner in the matching. The program arranges the trees such that matched edges are close to one another on the tree. Branches shown in red mean the colored branch is longer than the branch in the other tree, while those in blue are opposite; the intensity of the color indicates the degree of this difference.

(PDF)

Figure S2

The combined vaginal samples divided by race, plotted with respect to the first two principal components and colored by Nugent score.

(TIFF)

Acknowledgments

Aaron Gallagher co-developed the guppy framework and helped write the simulation code to validate squash clustering, Chris Small wrote the support overlap minimization code, and Noah Hoffman created the vaginal reference set with Sujatha Srinivasan. This work would not have been possible without an ongoing collaboration with David Fredricks, Noah Hoffman, Martin Morgan, and Sujatha Srinivasan at the Fred Hutchinson Cancer Research Center. The authors gratefully acknowledge Jacques Ravel, Pawel Gajer, and Larry Forney for sharing the data from their study of the vaginal microbiome. Jonathan Eisen and Aaron Darling provided early feedback on this work. Graphics were made with ggplot2 [31] and the archaeopteryx tree viewer (http://www.phylosoft.org/archaeopteryx/).

Funding Statement

The first author was supported in part by National Institutes of Health grant HG005966-01, and the second author was supported in part by National Science Foundation grant DMS-09-07630. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

1. Jaccard P (1908) Nouvelles recherches sur la distribution orale. Bull Soc Vaudoise Sci Nat 44: 223–270.
2. Lozupone C, Knight R (2005) UniFrac: a new phylogenetic method for comparing microbial communities. Appl Environ Microbiol 71: 8228. [PMC free article] [PubMed]
3. Lozupone CA, Hamady M, Kelley ST, Knight R (2007) Quantitative and qualitative beta diversity measures lead to different insights into factors that structure microbial communities. Appl Environ Microbiol 73: 1576–85. [PMC free article] [PubMed]
4. Costello E, Lauber C, Hamady M, Fierer N, Gordon J, et al. (2009) Bacterial community variation in human body habitats across space and time. Science 326: 1694–1697. [PMC free article] [PubMed]
5. Ley R, Turnbaugh P, Klein S, Gordon J (2006) Microbial ecology: human gut microbes associated with obesity. Nature 444: 1022–1023. [PubMed]
6. Nemergut D, Costello E, Hamady M, Lozupone C, Jiang L, et al. (2011) Global patterns in the biogeography of bacterial taxa. Environ Microbiol 13: 135–144. [PubMed]
7. Evans SN, Matsen FA (2012) The phylogenetic Kantorovich-Rubinstein metric for environmental sequence samples. J Royal Stat Soc (B) 74: 569–592. [PMC free article] [PubMed]
8. Berger SA, Krompass D, Stamatakis A (2011) Performance, accuracy, and web server for evolutionary placement of short sequence reads under maximum likelihood. Syst Biol 60: 291. [PMC free article] [PubMed]
9. Matsen FA, Kodner RB, Armbrust EV (2010) pplacer: linear time maximum-likelihood and Bayesian phylogenetic placement of sequences onto a fixed reference tree. BMC Bioinformatics 11: 538. [PMC free article] [PubMed]
10. Wang H, Marron J (2007) Object oriented data analysis: Sets of trees. Ann Stat 35: 1849–1873.
11. Nye T (2011) Principal components analysis in the space of phylogenetic trees. Ann Stat 39: 2716–2739.
12. Bik EM, Eckburg PB, Gill SR, Nelson KE, Purdom EA, et al. (2006) Molecular analysis of the bacterial microbiota in the human stomach. Proc Natl Acad Sci USA 103: 732–7. [PubMed]
13. Purdom E (2008) Analyzing data with graphs: Metagenomic data and the phylogenetic tree. UC Berkeley Statistics Technical Reports 766: 1–22.
14. Mitra S, Klar B, Huson DH (2009) Visual and statistical comparison of metagenomes. Bioinformatics 25: 1849–55. [PubMed]
15. Segata N, Izard J, Waldron L, Gevers D, Miropolsky L, et al. (2011) Metagenomic biomarker discovery and explanation. Genome Biol 12: R60. [PMC free article] [PubMed]
16. Matsen F, Hoffman N, Gallagher A, Stamatakis A (2012) A format for phylogenetic placements. PLOS ONE 7: e31009. [PMC free article] [PubMed]
17. Srinivasan S, Hoffman N, Morgan M, Matsen F, Fiedler T, et al. (2012) Bacterial communities in women with bacterial vaginosis: high resolution phylogenetic analyses reveal relationships of microbiota to clinical criteria. PLoS ONE 7: e37818. [PMC free article] [PubMed]
18. Gentleman RC, Carey VJ, Bates DM, Bolstad B, Dettling M, et al. (2004) Bioconductor: Open software development for computational biology and bioinformatics. Genome Biol 5: R80. [PMC free article] [PubMed]
19. R Development Core Team (2011) R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. Available: http://www.R-project.org.
20. Ravel J, Gajer P, Abdo Z, Schneider G, Koenig S, et al. (2011) Vaginal microbiome of reproductive-age women. Proc Natl Acad Sci USA 108: 4680. [PubMed]
21. Cole JR, Wang Q, Cardenas E, Fish J, Chai B, et al. (2009) The Ribosomal Database Project: improved alignments and new tools for rRNA analysis. Nucleic Acids Res 37: D141–D145. [PMC free article] [PubMed]
22. Stamatakis A (2006) RAxML-VI-HPC: maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models. Bioinformatics 22: 2688–2690. [PubMed]
23. Nawrocki E, Kolbe D, Eddy S (2009) Infernal 1.0: inference of RNA alignments. Bioinformatics 25: 1335–1337. [PMC free article] [PubMed]
24. Nugent RP, Krohn MA, Hillier SL (1991) Reliability of diagnosing bacterial vaginosis is improved by a standardized method of gram stain interpretation. J Clin Microbiol 29: 297–301. [PMC free article] [PubMed]
25. Zozaya-Hinchliffe M, Martin D, Ferris M (2008) Prevalence and abundance of uncultivated megasphaera-like bacteria in the human vaginal environment. Appl Environ Microbiol 74: 1656. [PMC free article] [PubMed]
26. Brady A, Salzberg S (2009) Phymm and PhymmBL: metagenomic phylogenetic classification with interpolated markov models. Nature methods 6: 673–676. [PMC free article] [PubMed]
27. Kuczynski J, Liu Z, Lozupone C, McDonald D, Fierer N, et al. (2010) Microbial community re- semblance methods differ in their ability to detect biologically relevant patterns. Nature methods 7: 813–819. [PMC free article] [PubMed]
28. White JR, Navlakha S, Nagarajan N, Ghodsi MR, Kingsford C, et al. (2010) Alignment and clustering of phylogenetic markers - implications for microbial diversity studies. BMC Bioinformatics 11. [PMC free article] [PubMed]
29. Morgan JL, Darling AE, Eisen JA (2010) Metagenomic Sequencing of an In Vitro-Simulated Mi- crobial Community. PLoS ONE 5. [PMC free article] [PubMed]
30. Robinson DF, Foulds LR (1981) Comparison of phylogenetic trees. Math Biosci 53: 131–147.
31. Wickham H (2009) ggplot2: elegant graphics for data analysis, volume 35 of use R! New York: Springer, 217 pp. doi:10.1007/978-0-387-98141-3.
32. Nye TMW, Liò P, Gilks WR (2006) A novel algorithm and web-based tool for comparing two alternative phylogenetic trees. Bioinformatics 22: 117–9. [PubMed]

Articles from PLoS ONE are provided here courtesy of Public Library of Science