Hierarchical clustering is a statistical method to discover and visualize structure in a multivariate dataset. It builds a tree-like model where different data objects (e.g. tumors in our case) or clusters of objects are connected to each other in order of their degree of similarity. The similarity between two tumors is often measured by the correlation coefficient of the marker values in the pair of tumors or alternatively by the Euclidean distance between them. The similarity between two clusters of tumors is determined by the linkage function. This could be the distance between the two closest (single linkage) or farthest (complete linkage) members of the clusters, or the average distance across all pairs of members of the two clusters (average linkage). In the clonality literature, two tumors from a patient are considered to be of clonal origin if they cluster together in a terminal branch of the tree.
There have been many recent articles published that exploit hierarchical clustering in this setting. This includes work by Waldman et al. (36
), Texiera et al. (26
), Wong et al. (41
), Torres et al. (39
), Ghazani et al. (37
), Agelopoulos et al. (38
), Brommesson et al. (42
), Liu et al. (43
) and Yang et al. (44
). Although many of these authors also used other analytical and biological methods to support their conclusions, hierarchial clustering was one of the principal tools. Several of the preceding articles explicitly state or assume that tumors that cluster together in a terminal branch are closer to each other than to any other tumor with respect to the measure of similarity used in the analysis. However this is not true unless single linkage is used. Instead, to our knowledge, studies of clonality have used complete, average or Ward linkage (45
), and none of them used single linkage. Thus, the diagnosis for any two tumors from a particular patient is influenced by both their relationship to other tumors in the cohort and the clustering patterns of other tumors among themselves. Yet the diagnosis of clonality in an individual patient should be informed primarily by the similarity of the tumors from the patient under investigation, rather than by their relationships with tumors from other patients.
In our view, hierarchical clustering is an inappropriate analytic tool for assessing tumor clonality. Hierarchial clustering is an unsupervised classification technique that assumes the number of clusters and their members are unknown, and its aim is to uncover possible general structure in the data. Conversely, in clonality testing the hypothetical clusters (tumors within patients) are known and the task at hand is to test whether pairs of tumors within a hypothetical cluster are independent or clonal. Omitting this known structure is not an efficient use of information, and it is naturally more difficult to identify the structure when all tumors are mixed up and pooled together.
On a related note, even if there is absolutely no dependence structure in the data, by design hierarchical clustering still identifies clusters, and by chance there are likely to be tumors from the same patient that end up in a “clonal” pairing. Thus, using hierarchical clustering without any significance assessment may be misleading. There are statistical methods that allow one to test whether clusters are reproducible by assessing similarity between clusters of perturbed data (46
). We note that some of the referenced studies, but not all, assessed whether the number of pairs of tumors determined to be clonal is higher than what would be expected by chance. To accomplish this the patient labels are randomly permuted and assigned to the final branches of the dendrogram, and the number of within-patient pairings is counted. Permutations are then repeated many times. If the original number of clonal patients is in the upper 5th
percentile of the counts from these permutations then there is statistically significant evidence of genuine clustering in the data. Some other similar techniques have also been used. While such an approach can test for false positive clustering overall in the data, it cannot assess the confidence of the diagnosis for any particular patient.
Finally, in using hierarchical clustering the more tumors there are in the dataset the less likely it will be that any two particular clonal tumors will be sufficiently close to each other to cluster together in a terminal branch and be classified as clonal. This occurs simply because, as the dataset increases, the clonal pair has to beat more and more competitors. Thus, inevitably, the sensitivity of hierarchical clustering to detect specific clonal pairs must decrease steadily as the sample size increases. These concepts are illustrated later in the section entitled “Simulation”.