|Home | About | Journals | Submit | Contact Us | Français|
Breast cancers can be classified by hierarchical clustering using an “intrinsic” gene list into one of at least five molecular subtypes: basal-like, HER2, luminal A, luminal B, and normal breast-like. Five different intrinsic gene lists composed of varying numbers of genes have been used for molecular subtype identification and classification of breast cancers. The aim of this study was to determine the objectivity and interobserver reproducibility of the assignment of molecular subtype classes by hierarchical cluster analysis.
Three publicly available breast cancer datasets (n = 779) were subjected to two-way average-linkage hierarchical cluster analysis using five distinct intrinsic gene lists. We used free-marginal Kappa statistics to analyze interobserver agreement among five breast cancer researchers for the whole classification and for each molecular subtype separately according to each intrinsic gene list for each breast cancer dataset.
None of the classification systems tested produced almost perfect agreement (Kappa ≥ 0.81) among observers. However, substantial interobserver agreement (70.8% to 76.1% of the samples and free-marginal Kappa scores from 0.635 to 0.701) was consistently observed in all datasets for four molecular subtypes (luminal, basal-like, HER2, and normal breast-like). When luminal cancers were subdivided (luminal A, B, and C), none of the classification systems produced substantial agreement (Kappa ≥ 0.61) in all the datasets analyzed. Analysis of each subtype separately revealed that only two (basal-like and HER2) could be reproducibly identified by independent observers (Kappa ≥ 0.81).
Assignment of molecular subtype classes of breast cancer based on the analysis of dendrograms obtained with hierarchical cluster analysis is subjective and shows modest interobserver reproducibility. For the development of a molecular taxonomy, objective definitions for each molecular subtype and standardized methods for their identification are required.
Hierarchical cluster analysis is used to classify tumors into subtypes identified through microarray-based gene expression profiling. These approaches are considered more objective and reproducible than histopathologic and immunohistochemical methods, but the subjectivity involved in assigning the molecular subtypes through dendrogram analysis has not been systematically analyzed.
The interobserver reproducibility among five breast cancer researchers experienced in molecular subtype assignment was determined using dendrograms and heatmaps generated by hierarchical clustering methods of three breast cancer datasets and five intrinsic gene lists.
The identification of subgroups of luminal cancers and normal breast-like cancers by visual inspection of dendrograms obtained from hierarchical cluster analysis shows suboptimal levels of interobserver agreement, even when the molecular subtypes are known a priori and guidelines for the identification of these subtypes are provided.
The assignment of molecular subtypes of breast cancer based on the visual inspection of dendrograms obtained with hierarchical cluster analysis is subjective and shows only modest interobserver reproducibility, particularly when subclassification of luminal cancers into two or more groups is required. Class discovery studies need to take into account both the stability of the clusters and the reproducibility of the classification system.
The datasets used were retrospectively accrued; hence, they may not have a balanced distribution of the different molecular subtypes and do not include samples of normal breast tissue. Publicly available microarray results were used for the hierarchical cluster analyses, and the data were not renormalized. A small proportion of genes from the intrinsic gene lists could not be reannotated in all datasets owing to different microarray platforms and changes in gene annotation.
From the Editors
The use of high-throughput methods for the analysis of cancers has provided new opportunities for understanding the diversity and heterogeneity of cancers and to devise classification systems that better recapitulate the biology and clinical behavior of human tumors. Class discovery studies have led to the identification of molecular subgroups of prognostic significance in multiple types of cancer, including lymphomas (1), sarcomas (2), pediatric malignancies (3), melanomas (4) and carcinomas (5, 6). There is a perception that these approaches may be more objective and reproducible than histopathologic and immunohistochemical methods (7,8).
Microarray-based gene expression profiling has highlighted the existence of breast cancer subtypes with distinct biology and clinical behavior (9,10). Expression profiling class discovery studies have led to a working model for a breast cancer molecular taxonomy (5,11–14), which has become widely used and recently adopted for the design of clinical trials (eg, NCT00546156).
Breast cancers can be classified by hierarchical cluster analysis using an “intrinsic” gene list [ie, list of “genes with significantly greater variation in expression between different tumours than between paired samples from the same tumour” (5)] into at least one of five molecular subtype classes: luminal A, luminal B, basal-like, HER2, and normal breast-like (5,10–14). Hierarchical clustering algorithms aggregate samples based on the similarity of their gene expression patterns and produce dendrograms, which are two-dimensional representations of the similarity between the samples and genes analyzed (ie, for each of two samples, the smaller the distance in the dendrogram arm or branch, the more similar the expression profiles of the samples). Five different intrinsic gene lists composed of varying numbers of genes have been reported (5,11–14). It has been assumed that molecular subtypes identified in different studies using different intrinsic gene lists are equivalent and reproducible with regard to their clinical, biological, and prognostic characteristics (ie, luminal A cancers in study “A” identified by the intrinsic gene list “a” are synonymous with luminal A cancers in study “B” identified by the intrinsic gene list “b”) (7,12,13).
Although hierarchical clustering has been widely used to identify molecular subtypes of breast cancer, this approach can only be applied retrospectively to sufficiently sized cohorts of patients (10,15) but not prospectively to individual samples. Therefore, three microarray-based “Single Sample Predictors” (SSPs) based on centroids (ie, the mean expression profile of each subtype) were developed (12–14). To define the SSPs, each molecular subtype was initially identified by hierarchical clustering based on the intrinsic gene list, and then the centroids of each molecular subtype (ie, luminal A, luminal B, HER2, basal-like, and normal breast-like) were derived. These SSPs, which can be applied to individual samples based on the correlations between the expression profile of a given sample and each of the centroids, recapitulate the classification obtained with hierarchical cluster analysis. We (16) and others (17) have recently demonstrated that the agreement between these SSPs is modest and that they can only reliably identify basal-like breast cancers.
Previous studies have highlighted the biostatistical limitations of hierarchical cluster analysis of microarray expression profiles for the identification of molecular subtypes of breast cancer and the relative instability of some of the molecular subtypes identified by this type of approach (15,18–21). One fundamental aspect of microarray-based class discovery studies, which has not been systematically analyzed, is the subjectivity involved in assigning the molecular subtypes through the analysis of dendrograms generated with hierarchical clustering methods.
The aim of this study was to determine the objectivity and interobserver reproducibility of the assignment of molecular subtype classes by hierarchical clustering (ie, do different observers assign the same patients to the same molecular subtype when they analyze the same dendrogram?). To address this question, we subjected three breast cancer datasets in the public domain [NKI-295 (22), TransBig (23), and Wang (24)] to hierarchical cluster analysis using five intrinsic gene lists from Perou et al. (5), Sorlie et al. (11), Sorlie et al. (12), Hu et al. (13), and Parker et al. (14). Subsequently, we determined the interobserver reproducibility among five breast cancer researchers who are experienced in molecular subtype assignment using the dendrograms and heatmaps generated by hierarchical clustering methods and the five intrinsic gene lists.
Microarray data from the publicly available breast cancer datasets, NKI-295 (22) (n = 295), Wang (24) (n = 286), and TransBig (23) (n = 198), were used for hierarchical cluster analysis. The normalized microarray-based gene expression data were retrieved from the internet or public repositories (NKI-295: http://microarray-pubs.stanford.edu/wound_NKI/explore.html; Wang: GEO:GSE2034; TransBig: GEO:GSE7390). Further details about the datasets and data acquisition are provided in Supplementary Table 1 (available online).
The assignment of molecular subtypes of breast cancer based on hierarchical cluster analysis was essentially performed as previously described (5,11–14,25). The distinct intrinsic gene lists reported by Perou et al. (5) (496 probes corresponding to 349 unique Human Genome Organization [HUGO] [http://www.genenames.org/] gene symbols), Sorlie et al. (11) (456 probes corresponding to 395 unique HUGO gene symbols), Sorlie et al. (12) (552 probes corresponding to 492 unique HUGO gene symbols), Hu et al. (13) (1400 probes corresponding to 1176 unique HUGO gene symbols), and Parker et al. (14) (1918 probes and 1918 unique HUGO gene symbols) were retrieved (Supplementary Tables 2–6 and Supplementary Methods, available online).
A substantial proportion of the gene identifiers reported in the original publications have changed in more recent genome builds; therefore, annotations of intrinsic gene lists and breast cancer datasets were comprehensively updated and mapped to build 36 of the human genome (Ensembl assembly 54 [http://www.ensembl.org/index.html]) as described previously (16) (Supplementary Tables 2–6, available online). Of the identifiers tested (ie, HUGO gene symbols, Ensembl, and Unigene [http://www.ncbi.nlm.nih.gov/unigene]), the annotation with HUGO gene symbols allowed for the retrieval of the highest proportion of genes in the majority of intrinsic gene lists and datasets (Supplementary Table 7, available online). As observed in the original dendrograms and descriptions of the intrinsic gene lists (5,11–13), when multiple probes mapped to the same gene, all were included in the hierarchical cluster analysis. Analyses were performed in R version 2.9.0 (http://cran.rproject.org/).
Two-way average-linkage hierarchical clustering (median centered by feature and gene and Pearson correlation as the gene similarity metric) was applied to each dataset using Cluster 3.0 [http://bonsai.hgc.jp/~mdehoon/software/cluster/software.htm#ctv] as previously described (11–14,25), and results were visualized with Java Treeview [http://jtreeview.sourceforge.net/]. Additional details of the microarray data handling and hierarchical clustering are available in the Supplementary Methods (available online). Annotated datasets are available at http://rock.icr.ac.uk/collaborations/Mackay/centroid.correlations.Eset; annotated intrinsic gene lists used for hierarchical clustering, clustered data and Java Treeview files for each of the datasets presented are available at http://rock.icr.ac.uk/collaborations/Mackay/observer.clustering.
To determine the reproducibility of microarray-based classification of breast cancers by hierarchical cluster analysis, the study curator (J. S. Reis-Filho) selected five researchers on the basis of experience in microarray-based expression profiling analysis, previous publications on the use of microarray-based molecular taxonomy of breast cancer, and having a first or senior author publication on microarrays in a journal with a 2008 Thompson ISI impact factor greater than 5.
The study curator created guidelines that described in detail how each molecular subtype should be identified by the visual analysis of the dendrograms obtained with hierarchical cluster analysis for each intrinsic gene list. These guidelines were based on extracts from the studies by Perou et al. (5), Sorlie et al. (11), Sorlie et al. (12), Hu et al. (13), and Parker et al. (14), and they summarized the characteristics of each molecular subtype according to each intrinsic gene list, provided graphical representations of the dendrograms and heatmaps obtained by applying each intrinsic gene list to a separate dataset of breast cancers, and additional details extracted from their respective original studies (for details see Supplementary Methods, available online). It should be emphasized that the original publications (5,11–14) did not provide clear guidelines of the levels at which the dendrogram branches should be cut to define the molecular subtypes.
The guidelines, the dendrograms, and color heatmaps obtained from hierarchical cluster analysis of the three breast cancer datasets (22–24) using the five intrinsic gene lists (5,11–14) (Supplementary Figures 1–15, available online) were sent to five of the authors (A. Mackay, B. Weigelt, A. Grigoriadis, B. Kreike, R. Natrajan) via email, together with a copy of the original studies (5,11–14) describing the intrinsic gene lists. Observers were requested to classify each dataset according to the methods described by Perou et al. (5), Sorlie et al. (11), Sorlie et al. (12), Hu et al. (13), and Parker et al. (14), identifying all molecular subtypes described in each publication. If samples in a dendrogram could not be assigned to a molecular subtype with confidence, the observers could opt for considering the sample as unclassifiable, as done in Sorlie et al. (12) and Parker et al. (14). A request to keep the correspondence strictly confidential was made, and no discussions with other researchers were permitted. The identity of each observer was kept confidential from the other study participants. Molecular subtype assignments were made by each observer blinded to the results reported by the other observers and sent directly to the study curator.
Statistical analysis of the molecular subtype assignments made by the five observers was performed by two of the authors (RA’H and JSR-F), without providing any feedback to the observers. The percentage of overall agreement and the multirater analysis of agreement was performed as previously described (26). We used the free-marginal Kappa statistics of Brennan and Prediger (26), which is optimal for the assessment of agreement among more than two observers (ie, raters) when categorical variables are used, and observers are not forced to assign a certain number of samples to each category. The choice of free-marginal Kappa score was based on the fact that this method minimizes prevalence-related biases and would be compatible with the choice of observers considering samples that could not be assigned to a molecular subtype as unclassifiable. Using this statistical method, Kappa values can range from −1.0 to 1.0, with −1.0 indicating perfect disagreement below chance, 0.0 agreement equal to chance, and 1.0 perfect agreement above chance. Kappa values can be interpreted as follows: 0.01–0.2 as slight agreement, 0.21–0.4 as fair agreement, 0.41–0.6 as moderate agreement, 0.61–0.8 as substantial agreement, and 0.81–1.0 as almost perfect agreement (26–28).
We determined the interobserver agreement for the whole classification obtained from the analysis of the dendrogram produced with each intrinsic gene list for each breast cancer dataset [NKI-295 (22), Wang (24) and TransBig (23)]. In addition, we analyzed interobserver agreement for each molecular subtype according to each intrinsic gene list for each breast cancer dataset.
To test whether the free-marginal Kappa scores, when the luminal group was subdivided into the A, B, and C subgroups [Sorlie et al. (11)], were statistically significantly lower than the free-marginal Kappa scores when the luminal cancers were considered as a single group [Perou et al. (5)], we used a nonparametric test (Mann–Whitney U test).
Samples from these datasets were not previously classified by the proponents (5,11–14) of the molecular classification into the molecular subtypes by means of hierarchical clustering using all five intrinsic gene sets tested in this work. Thus, given that there is no available “gold standard” for the classification of samples from the three breast cancer datasets analyzed here into molecular subtypes by hierarchical clustering for all intrinsic gene lists, we also determined the percentage of samples with perfect agreement (samples for which all raters/observers agreed on the classification of the sample) and the percentage of samples with a “majority consensus” (samples for which three or more raters/observers agreed on the classification of a given sample into one of the molecular subtypes).
We first sought to define the reproducibility by different observers of the whole classification system according to Perou et al. (5), Sorlie et al. (11), Sorlie et al. (12), Hu et al. (13), and Parker et al. (14). None of the classification systems tested produced almost perfect agreement (free-marginal Kappa scores ≥ 0.81) among observers (Table 1, Supplementary Tables 8–10 available online).
For Perou et al. (5), four molecular subtypes were described, luminal, basal-like, HER2, and normal breast-like. The interobserver overall agreement for this classification system ranged from 70.8% to 76.1% of the samples according to the dataset analyzed, and the free-marginal Kappa scores ranged from 0.635 to 0.701 (ie, substantial agreement [Kappa scores ≥ 0.61] among observers in all datasets; Table 1; Figures 1, A, A,2,2, A, and and3,3, A; Supplementary Figures 1–3, available online). Perfect interobserver agreement (five out of five observers) was found in 42.4% to 63.6% of samples. Importantly, interobserver disagreement in the classification system proposed by Perou et al. (5) was restricted to the assignment of luminal and normal breast-like subtypes (Table 2).
With the introduction of subdivisions of the luminal molecular subtype into luminal A, luminal B, and luminal C in Sorlie et al. (11), the overall agreement rates (51.5% to 64.1%) and the free-marginal Kappa scores were substantially reduced (0.435–0.582—only moderate agreement among observers; Mann–Whitney U test one-tailed P = .05) (Table 1, Figures 1, B, 2, B, and 3, B, and Supplementary Figures 4–6, available online). Perfect interobserver agreement was 17.5% to 46.1%, and a majority consensus (three or more observers agreeing on the classification of a given sample) was found in 79% (NKI-295 dataset) to 97.5% (TransBig dataset) of the samples.
Of the remaining classification systems, including subdivisions of luminal cancers into luminal A and luminal B (12–14) (Supplementary Figures 7–15, available online), none produced free-marginal Kappa scores of at least 0.61 in all datasets (Table 1). It should be noted, however, that the Hu et al. (13) classification system had better overall agreement and free-marginal Kappa scores than the other classification systems with five or more subtypes in the NKI-295 and TransBig datasets (11,12,14) (Table 1, Figures 1, D, 2, D, and 3, D, Supplementary Figures 10–12, available online). Importantly, more than 95% of samples with a majority consensus were found in only two or more datasets when the Perou et al. (5) and Hu et al. (13) classifications were used, but not with Sorlie et al. (11), Sorlie et al. (12) or Parker et al. (14) (Table 1).
Analysis of the interobserver agreement for the identification of each molecular subtype separately revealed that basal-like cancers could be reproducibly identified by independent observers in all datasets regardless of the classification system used, with overall agreement rates consistently greater than 95%, free-marginal Kappa scores of at least 0.81, and a majority consensus greater than 90% in all datasets (Table 2 and Supplementary Table 11, available online).
HER2-positive cancers also consistently displayed free-marginal Kappa scores of at least 0.81 and overall agreement rates greater than 90%; a majority consensus was found in more than 95% of samples in the NKI-295 (22) and TransBig (23) datasets, regardless of the classification used; however, in the Wang dataset, a majority consensus was found in more than 90% of samples only when Perou et al. (5) or Hu et al. (13) intrinsic gene lists were used (Table 2 and Supplementary Table 11, available online); with the other intrinsic gene lists, discordances were observed in the assignment of HER2-positive tumors as unclassifiable (Figure 2 and Supplementary Figures 2 and 14, available online).
The identification of luminal cancers and their subgroups and normal breast-like cancers failed to show acceptable levels of overall agreement or to consistently display free-marginal Kappa scores of at least 0.81 in all datasets (Table 2 and Supplementary Table 11, available online). When the Sorlie et al. (11) classification system comprising three categories of luminal cancers (ie, luminal A, B, and C) was used, a majority consensus for the classification of samples was found in less than 50% of luminal cancers [46.5% in NKI-295 (22), 50% in TransBig (23), and 48.8% in the Wang (24) dataset; Supplementary Table 11]. With the use of two categories of luminal cancers (ie, luminal A and luminal B) (12–14), a majority consensus for more than 50% of luminal cancers in all datasets was only found when hierarchical clustering using the intrinsic gene list of Hu et al. (13) was used. Notably, the interferon-rich subtype, only identified in the Hu et al. (13) intrinsic gene list, displayed almost perfect agreement levels among observers (overall agreement >97% and free-marginal Kappa scores ≥ 0.81; Table 2).
The results presented in this study provide direct evidence that the identification of subgroups of luminal cancers and normal breast-like cancers by visual inspection of dendrograms obtained from hierarchical cluster analysis shows suboptimal levels of interobserver agreement, even when the molecular subtypes are known a priori, and guidelines for the identification of these subtypes are provided. The identification of basal-like and HER2 cancers showed almost perfect interobserver agreement rates regardless of the intrinsic gene list used.
Microarray-based expression profiling analysis has led to a paradigm shift in the way breast cancer is perceived (9,10). Class discovery studies have demonstrated the existence of five main molecular subtypes, namely basal-like, HER2, luminal A, luminal B, and normal breast-like, but luminal C and interferon-regulated subtypes have also been described (5,9–14). These five main subtypes have been reported to have distinct clinical presentations (29), sites of relapse (30), histological features (31), responses to chemotherapy (14,32), and outcomes (10,11,13,15). Despite being derived from unsupervised approaches for class discovery, this molecular classification to some extent recapitulates the clinical subgroups of breast cancer identified in clinical practice. In fact, there is evidence to suggest a strong association between the molecular subtype classes (luminal A, luminal B, HER2, and basal-like) and the clinical categories of breast cancer (tamoxifen-sensitive estrogen receptor positive [ER+], tamoxifen-resistant ER+, trastuzumab-sensitive, and other) (9,10).
It has been argued that microarray expression profiling is the gold standard for breast cancer classification (7); however, several lines of evidence suggest that there are major limitations in our ability to assign samples consistently to specific molecular subtypes (10,15,33). We and others have recently demonstrated that SSPs fail to assign individual samples reproducibly into molecular subtypes (16,17). Here, we demonstrate that apart from basal-like and HER2 breast cancer subtypes, the interobserver reproducibility of breast cancer molecular subtype assignment using the methods and approaches originally used for this purpose is modest, in particular for the identification of luminal A, luminal B, and normal breast-like subtypes. None of the intrinsic gene lists concurrently provided almost perfect agreement (ie, free-marginal Kappa scores ≥0.81) for the luminal A, luminal B, and normal breast-like subtypes in any of the datasets. In comparison, for example, the interobserver agreement of ER and HER2 immunohistochemical staining of breast cancers using tissue microarrays has been reported to be high (Kappa scores ≥ 0.81) (34–37). Furthermore, the Kappa scores observed in this study overlap with those observed in analyses of interobserver agreement of histological grade (Kappa scores ranging from 0.43 to 0.83) (38). It should be noted that similar Kappa scores have been considered by many as inadequate and as evidence for the subjective nature of histological grade (39,40).
It is plausible that the limited interobserver agreement for the subclassification of luminal cancers may stem from attempting to identify distinct groups within a continuum (41). The distinction between luminal A and luminal B tumors is reported to be principally driven by the expression of proliferation-related genes (7,13); however, several studies have recently demonstrated that proliferation in ER+ breast cancers is a continuum rather than a bimodal distribution (10,41,42). Therefore, allocation of specific subgroups (eg, luminal A and B) by hierarchical cluster analysis is likely to be arbitrary and to depend on the population of samples subjected to the analysis (15,43), which may explain why the luminal B cluster was identified in the ER+ arm of the cluster dendrogram in three studies (11,13,14) and in the ER− arm in one study (12).
The lack of agreement in the identification of normal breast-like tumors should perhaps not come as a surprise, given that these tumors may constitute an artifact of gene expression profiling analysis [ie, analysis of tumor specimens with a disproportionately high percentage of normal tissue “contamination” (7,13,14)]. The normal breast-like gene cluster in the heatmaps has been either represented by only a few genes [ie, Sorlie et al. (12)] or not even specified [ie, Hu et al. (13)]. Moreover, this gene cluster was composed of different genes in studies in which it was reported (5,11,12). Notably, normal breast-like tumors have been reported in both the ER− (5,11,12) and the ER+ branch of the cluster dendrogram (13,14), which may have contributed to the poor reproducibility of the normal breast-like subtype assignment among different observers in this study.
Hierarchical cluster analysis was the method of choice for the development of the current working model of microarray-based breast cancer taxonomy (5,11–14) and of the SSPs, which can be applied prospectively to single samples for molecular subtype assignment. However, previous studies (15,19) (and references therein) have demonstrated that hierarchical cluster analysis has several limitations for the identification of subtypes of breast cancer, that is, relevant features and distance measures have to be selected a priori, the actual number of clusters is unknown, and clusters are always generated even when random data are used (21). Therefore, the emergence of “clusters” does not necessarily equate with biological significance. The interpretation of dendrograms resulting from the analysis of breast cancers is by no means trivial, as illustrated here (Figures 1–3; Supplementary Figures 1–15, available online); however, it becomes even more complex when different dendrograms are cut at different levels and different methods and approaches are used. In fact, in different studies, dendrograms obtained from hierarchical clustering using different intrinsic gene lists were cut at different levels (5,11,12,14), and sometimes molecular subtypes were defined in the same dendrogram by cutting the branches at different levels [eg, molecular subtype assignments described in Sorlie et al. (11)]. In the most recent publication by Parker et al. (14), the cluster dendrogram was analyzed using “SigClust” (44), a tool for assessing statistical significance of a cluster. This method was used to identify “prototypic tumor samples” from each of the molecular subtypes, which were used to derive a minimized gene set for the development of a 50-gene set for quantitative reverse transcription polymerase chain reaction for sample subtype prediction (PAM50). Conceivably, this method might lead to a more consistent assignment of clusters; it should be noted, however, that three different SSPs, two of which were generated with subtypes initially identified without the use of SigClust, showed equivalent associations with outcome in three distinct datasets (16).
The interpretation of the clusters identified by Perou et al. (5) was based on the relationship between the genes over- or underexpressed in samples classified into each cluster, and clinical and biological characteristics of breast cancers that were already known. Surprisingly, some of the genes that defined the initial subtypes were not present in subsequent versions of the intrinsic gene lists [eg, keratin 8/18 (KRT8/18), one of the defining features of luminal cancers, is not present in Sorlie et al. (12); keratin 17 (KRT17), but not keratin 5 (KRT5), the defining features of basal-like cancers, is not present in Hu et al. (13); aquaporin 7 (AQP7), integrin alpha 7 (ITGA7), thrombospondin receptor (CD36) [Sorlie et al. (11), aldo-keto reductase family 1, member C1 (AKR1C1) and phosphoinositide-3-kinase, regulatory subunit 1(α) (PIK3R1) (Sorlie et al. (12)], genes pertaining to the normal breast-like cluster in Sorlie et al. (11) and Sorlie et al. (12), respectively, are not present in Hu et al. (13)]. Another important limitation of hierarchical cluster analysis, as elegantly illustrated by Pusztai et al. (15), is the lack of stability of some of the subgroups identified. Although there are algorithms to determine the stability of clusters generated by hierarchical clustering (21), they do not provide an assessment of interobserver variability in molecular subtype assignment via inspection of the dendrograms. In this study, we systematically analyzed the ability of experienced observers to identify the molecular subtypes through the analysis of dendrograms and demonstrated that even when clear guidelines are provided, the assignment of samples is subjective and not entirely reproducible.
Several groups, including ours, have previously attempted to define the molecular subtypes in breast cancer datasets using hierarchical cluster analysis (30,45–49). Given the lack of interobserver agreement and stability of some of the molecular subtypes, as discussed above, our findings indicate that breast cancer molecular subtype classifications performed by other investigators may not have necessarily reproduced those originally described and that molecular subtypes identified by the same intrinsic gene list in different cohorts analyzed by different observers are not necessarily equivalent.
This study had several limitations. First, the datasets used were retrospectively accrued; hence, they may not have a balanced distribution of the different molecular subtypes and do not include samples of normal breast tissue. Second, we used the publicly available processed microarray for the hierarchical cluster analyses; no attempts to renormalize the data were made. Third, although we endeavored to reannotate all genes from each gene list, a small proportion of genes from the intrinsic gene lists could not be annotated in all datasets owing to different microarray platforms and changes in gene annotation (see Supplementary Methods, available online). Fourth, the datasets included in this study derive from different microarray platforms. Fifth, one cannot rule out that if five observers from the groups of the proponents of the breast cancer taxonomy (5,11–13,36) were asked to assign the molecular subtypes based on the visual inspection of dendrograms and gene clusters, a better interobserver agreement would be found. Finally, this study focused on the interobserved reproducibility of the assignment of molecular subtypes by inspection of dendrograms and gene clusters; we have not investigated whether bioinformatic methods to define the statistical robustness of clusters [eg, SigClust (44), Pvclust (50), or R and D indices (21)] would increase the reproducibility.
It is beyond the scope of this work to evaluate algorithms for cluster analysis, and the choices of distance metrics and linkage. Instead, we have focused on the human-dependent component of class discovery analysis and demonstrated that this subjective component leads to substantial variability in cluster assignment. Moreover, a direct comparison between the molecular subtypes identified by distinct intrinsic gene lists applied to the same datasets [eg, Perou et al. (5) vs Parker et al. (14), Hu et al. (13) vs Sorlie et al. (11)] would provide an inflated rate of disagreement, because different numbers of subclasses were reported in each classification, and there is no gold standard for each of the intrinsic gene lists. In fact, the reported agreement between the Sorlie et al. (12) and Hu et al. (13) intrinsic gene lists, when applied to the same dataset, was 78% when the samples classified as the interferon-rich subtype were excluded (13).
The subjectivity and modest reproducibility of the interpretation of histopathologic features and immunohistochemical stainings have been heavily criticized, and the need for more objective methods to guide the breast cancer patient in decision making is clearly justified. Hierarchical cluster analysis is undoubtedly a powerful tool for class discovery and a useful first step for the development of a molecular classification. However, hierarchical clustering may not be an ideal choice as a method for breast cancer classification because it is neither entirely objective nor are its results entirely reproducible. In fact, current molecular classification systems for breast cancer are similarly to histopathology, descriptive, and prognostic (10,16). Based on the available data and the limitations of our knowledge on the heterogeneity of breast cancers, it is still not possible to determine with absolute certainty how many molecular subtype classes do exist (15). Hence, with the increasingly more coherent information about the genetic (51) and transcriptomic features of breast cancer (9,10,52), mechanisms of action of chemotherapy agents, and availability of humanized monoclonal antibodies and small molecule inhibitors that target specific molecular pathways and networks, perhaps molecular classification should be designed to provide more direct functional information for clinicians to facilitate the treatment of breast cancer patients. For example, a certain breast cancer subtype could be classified based on the presence or absence of overexpressed or mutated genes that may serve as predictive biomarkers for specific therapeutic agents (9,10,53). The development of such a taxonomy is likely to require integrative approaches combining descriptive analysis of genomic, transcriptomic, and proteomic data from sufficiently statistically powered cohorts (54) with multidimensional data from global functional analyses of large panels of cancer models (eg, genome-wide RNA interference screens and chemical screens) (9,10,51,53,55,56).
In conclusion, we demonstrate that assignment of molecular subtypes of breast cancer based on the visual inspection of dendrograms obtained with hierarchical cluster analysis is subjective and shows modest interobserver reproducibility, in particular when subclassification of luminal cancers into two or more groups is required. These results suggest that class discovery studies, in which subtypes are identified by inspection of dendrograms [eg, (5,11,12)], need to take into account both the stability of the clusters and the reproducibility of the classification system (16,41). This is of paramount importance, as SSPs used for the prospective classification of breast cancer patients into specific molecular subtypes have been derived from the analysis of subtypes originally identified by hierarchical clustering methods (12–14). For the incorporation of the molecular taxonomy into clinical trials, routine clinical practice and treatment decision making, stringent standardization of methodologies for the identification of breast cancer molecular subtypes, and objective definitions for each molecular subtype are of utmost importance.
This work was supported by Breakthrough Breast Cancer. B.W. is funded by a Cancer Research UK postdoctoral fellowship. We also acknowledge NHS funding to the NIHR Biomedical Research Centre. Jorge S. Reis-Filho is the recipient of the 2010 Cancer Research UK Future Leaders Prize.
A. Mackay and B. Weigelt contributed equally to this study. The authors have no conflict of interest to declare. The funders did not have any involvement in the design of the study; the collection, analysis, and interpretation of the data; the writing of the manuscript; or the decision to submit the manuscript for publication.