|Home | About | Journals | Submit | Contact Us | Français|
Single sample predictors (SSPs) and Subtype classification models (SCMs) are gene expression–based classifiers used to identify the four primary molecular subtypes of breast cancer (basal-like, HER2-enriched, luminal A, and luminal B). SSPs use hierarchical clustering, followed by nearest centroid classification, based on large sets of tumor-intrinsic genes. SCMs use a mixture of Gaussian distributions based on sets of genes with expression specifically correlated with three key breast cancer genes (estrogen receptor [ER], HER2, and aurora kinase A [AURKA]). The aim of this study was to compare the robustness, classification concordance, and prognostic value of these classifiers with those of a simplified three-gene SCM in a large compendium of microarray datasets.
Thirty-six publicly available breast cancer datasets (n = 5715) were subjected to molecular subtyping using five published classifiers (three SSPs and two SCMs) and SCMGENE, the new three-gene (ER, HER2, and AURKA) SCM. We used the prediction strength statistic to estimate robustness of the classification models, defined as the capacity of a classifier to assign the same tumors to the same subtypes independently of the dataset used to fit it. We used Cohen κ and Cramer V coefficients to assess concordance between the subtype classifiers and association with clinical variables, respectively. We used Kaplan–Meier survival curves and cross-validated partial likelihood to compare prognostic value of the resulting classifications. All statistical tests were two-sided.
SCMs were statistically significantly more robust than SSPs, with SCMGENE being the most robust because of its simplicity. SCMGENE was statistically significantly concordant with published SCMs (κ = 0.65–0.70) and SSPs (κ = 0.34–0.59), statistically significantly associated with ER (V = 0.64), HER2 (V = 0.52) status, and histological grade (V = 0.55), and yielded similar strong prognostic value.
Our results suggest that adequate classification of the major and clinically relevant molecular subtypes of breast cancer can be robustly achieved with quantitative measurements of three key genes.
Single sample predictors (SSPs) are molecular classification models that use large sets of genes expressed in different tumors to classify different subtypes of breast cancer. Subtype classification models (SCMs) are based on groups of genes specifically correlated with three key breast cancer genes, estrogen receptor (ER), HER2, and aurora kinase A (AURKA). Both types of models use large numbers of genes. However, the robustness and prognostic value of these classifiers have not been compared with simplified models containing fewer genes.
A simplified SCM (SCMGENE) containing only ER, HER2, and AURKA was compared with three SSPs and two SCMs using data from 36 gene expression datasets in public databases. The models were compared with respect to concordance among themselves as well as association with clinical variables and disease-free survival.
Among the SCMs, SCMGENE with only three genes was statistically more robust than SSPs and as robust and yielded similar prognostic value compared with the published SCMs that use large numbers of genes.
Adding more genes to a classification model may not improve the ability to discriminate among breast cancer subtypes. In addition, the complexity of multiple-gene classification models may limit their usefulness and translation into clinic.
The datasets used were retrospectively accrued; therefore, the selection of patients may have resulted in unbalanced distribution of the different molecular subtypes. The gene expression datasets taken from public databases and websites were not renormalized. Software limitations precluded checking or correction for departure from proportional hazards assumptions.
From the Editors
Microarray-based expression studies have demonstrated that breast cancer is both a clinically diverse and molecularly heterogeneous disease comprising subtypes with distinct gene expression patterns that are associated with outcome (1–8). The relevance of these subtypes for basic and translational research has led to their use in prognostic assessments (3,9), prediction of therapeutic efficacy (10), and retrospective analysis of clinical trials (11). Independent of subtype analysis, several research groups have developed prognostic gene signatures by analyzing gene expression together with clinical outcome data [see (12) for a review]; Mammaprint (13,14), Oncotype Dx (15), and the Gene expression Grade Index (GGI) (16) are currently commercially available. Although the data and statistical methods used to develop these prognostic gene signatures differ from those used for breast cancer molecular subtyping, we and others reported a high rate of concordance between the predicted risk classifications and subtypes (1,8,17), shedding new light on the common biological processes relevant for predicting outcome in breast cancer.
In their seminal work, Perou et al. (4) identified four breast tumor subtypes: the basal-like, HER2-enriched, the luminal (often differentiated into two or three subgroups), and the normal-like tumors. These molecular subtypes were identified by selecting a large set of “intrinsic” genes (those showing little variance within repeated samplings of the same tumor but with high variance across tumors) and then using hierarchical clustering to separate patients into transcriptionally distinct groups (4). However, because only samples in large retrospective studies could be classified by this method, the authors developed the Single Sample Predictor (SSP; Figure 1, A), which identifies the subtype of a single tumor using a nearest centroid classifier (2,3,6). This initial SSP [SSP2003; Figure 1, C; (6)] was further refined through iterations of the intrinsic gene list and the resulting two SSPs [SSP2006 and PAM50; Figure 1, C; (2,3)]. These SSPs have been applied to gene expression data generated from different cohorts of breast cancer patients and microarray technologies (2,17).
However, all SSPs have severe limitations. Pusztai et al. (18) showed that small changes in the initial set of breast tumors can have a dramatic impact on the hierarchical clustering used in defining the initial subgroups for the SSPs, raising questions about the stability of the method (18,19). Kapp et al. (20) challenged the use of hundreds of intrinsic genes and showed that only genes related to estrogen receptor (ER) and HER2 phenotypes lead to a stable identification of three main subtypes: ER−/HER2− (basal-like tumors), HER2+ (HER2-enriched), and ER+/HER2− (combined luminal A and B tumors) (20). Weigelt et al. (21) reported that subtype classifications depend on the list of intrinsic genes because SSPs were only moderately concordant. Recently, Mackay et al. (22) highlighted the lack of interobserver agreement for manually identifying subtypes from dendrograms estimated by hierarchical clustering.
To address these issues, we developed an alternative classification approach, the Subtype Classification Model (SCM; Figure 1, B). In contrast to SSPs, SCMs are based on a mixture of three Gaussian distributions in a two-dimensional space defined by the ER and HER2 gene modules, with a proliferation (aurora kinase A [AURKA]) module providing discrimination between low and high proliferative tumors (1,8). These modules are composed of genes whose expression is specifically correlated with their prototype gene—ER, HER2, or AURKA (1,8). Two versions of gene lists representing these modules have been published [Figure 1, D; (1,8)]. SCMs have been applied in datasets using different microarray platforms and normalization methods (1,8,9).
Although establishing breast cancer molecular subtypes has had a substantial impact on the way clinicians perceive the disease, we know surprisingly little about the reproducibility of the various classification algorithms (19) because of the intrinsic nature of subtype identification where the true classification remains unknown, rendering the validation of the corresponding classifiers difficult. Weigelt et al. (21) recently estimated the agreement of the three published SSPs in four public datasets and showed that these classifiers are only moderately concordant.
Here, we address the issue of reproducibility, comparing SSPs and SCMs to assess their robustness (defined as the capacity of a classifier to assign the same tumors to the same subtypes independently of the dataset used to fit it), their classification concordance, and their prognostic value. To do this, we first developed SCMGENE, a simplified version of the SCM using only the three key genes (ER, HER2, and AURKA) known to be the main discriminators of clinical and molecular breast cancer subtypes (1,8,20). We then used a large compendium of gene expression and clinical data from 5715 breast cancer patients to assess the relative performance of six subtype classifiers, the three published SSPs, two published SCMs, and the simplified version of SCM, namely SCMGENE.
All analyses have been performed using R version 2.13.1 (http://www.r-project.org/). To ensure full reproducibility of our results, software and data are available at http://compbio.dfci.harvard.edu/pubs/sbtpaper/. Differences with P values less than .05 were considered statistically significant. All statistical tests were two-sided.
Thirty-six gene expression datasets of expression profiles from 5715 tumors were retrieved from public databases or authors’ websites (Table 1); this includes 676 ER-positive (ER+ as defined by immunohistochemistry [IHC]) breast tumors from tamoxifen-treated patients (Table 1). We used normalized log2(intensity) for single-channel platforms and log2(ratio) in dual-channel platforms. Hybridization probes were mapped to Entrez GeneID as described in Shi et al. (58) using RefSeq and Entrez whenever possible; otherwise mapping was performed using IDconverter [http://idconverter.bioinfo.cnio.es/; (59)]. When multiple probes mapped to the same GeneID, we used the one with the highest variance in the dataset under study. To facilitate comparison between datasets, we applied a robust linear scaling to each gene or module score where expression quantiles 2.5% and 97.5% were set to −1 and +1, respectively. This procedure was particularly efficient in datasets with skewed populations of patients (such as those with different proportions of ER−/+ or HER2−/+ tumors), because only a few extreme cases (5%) were needed to perform the robust scaling, without relying on outliers. This scaling improved the consistency between classifiers in datasets using different microarray technologies and normalization procedures (60,61).
We also collected the publicly available clinical and demographic information for our compendium of datasets (Table 2).
Classification gene lists were manually transcribed from original publications (Figure 1, C and D) and submitted to GeneSigDB (62). The SSP models were SSP2003 with 500 genes (6), SSP2006 with 306 genes (2), and PAM50 with 50 genes (3). The SCM models were SCMOD1 with 726 genes (1), SCMOD2 with 663 genes (8), and SCMGENE from this study. The SSP and SCM algorithms were implemented as described in the original publications and adapted for scaled data; the source code and documentation are available in the genefu R/Bioconductor package version 1.3.4 [http://www.bioconductor.org/packages/release/bioc/html/genefu.html; (63)]; the methodology underlying the SCMs and the corresponding R code are further detailed in Supplementary Methods, Parts 1 and 2, respectively (available online). Percent of genes used in the classifiers that are actually mapped in each dataset is reported in Supplementary Table 1 (available online). Throughout, we use the Perou et al. (4–6) subtype nomenclature (Figure 1, A): basal-like, HER2-enriched, and luminal A/B, which correspond, respectively, to the ER−/HER2−, HER2+, and ER+/HER2− low/high proliferation tumors of Sotiriou et al. (1,8) (Figure 1, B).
In this study, we developed SCMGENE by selecting the genes ER, HER2, and AURKA, which have been used as prototypes in the ER, HER2 signaling, and proliferation gene modules published in Desmedt et al. (1) and Wirapati et al. (8). We used the Affymetrix probesets published in Desmedt et al. (1), that is, 205225_at, 216836_s_at, and 208079_s_at, representing ER, HER2, and AURKA, respectively (see “scmgene” object in the genefu package).
To assess robustness, we used the “prediction strength” statistic (64) (Supplementary Methods Part 1, available online), as implemented in the genefu package. First, using each classifier, all samples were assigned “true” subtype labels for that classifier in each dataset separately. The data were then split into training and test sets; the classification model fitted on the training set was applied to the test sets (“predicted” labels) and compared with the true classifications. The prediction strength quantifies the similarity between the true and predicted classifications in each dataset. Values range from 0 (low similarity) to 1 (high similarity), and a prediction strength of at least 0.8 indicates a robust classification (64). Statistical comparison of classifier robustness was performed using the two-sided Wilcoxon signed rank test (65); P values were two-sided and Holm corrected for multiple testing (66).
Because there is no clear consensus about the number of breast cancer molecular subtypes, we analyzed robustness of classifiers for assignment to either three or four subtype groups (20,21). Robustness of SSPs was computed for three to five subtypes by selecting the main clusters, which contain at least five tumors, as defined by the dendrogram built using hierarchical clustering (correlation distance and average linkage). Robustness of SCMs was computed for three subtypes by fitting a mixture of three Gaussian distributions (equal variance and shape, see Supplementary Methods Part 1, available online) and for four subtypes by further estimating a cutoff for proliferation to discriminate between low and high proliferative tumors. Note that, in contrast to SSPs, SCMs are limited to the identification of three and four subtypes by construction [(1,8); see Supplementary Methods Part 1, available online].
To assess the concordance of subtype classifications with published prognostic gene signatures, we implemented the original algorithms of the MammaPrint (MAMMAPRINT) (14), the Oncotype DX (ONCOTYPE) (15) gene signatures, and the Gene expression Grade Index (GGI) (16), in our compendium of microarray datasets, similarly to Fan et al. (17). The corresponding source code and documentation are available in the genefu package. The resulting risk predictions were labeled low, intermediate, and high risk to reflect the prognosis of the patients. Percent of genes in the signatures that are mapped in each dataset are reported in Supplementary Table 1 (available online).
We used color bars to represent the concordance of subtype classifications and prognostic gene signatures; in this representation, subtypes and risk groups are represented by unique colors. Tumors were ordered according to the probabilities estimated by a subtype classifier, such as SCMGENE or PAM50. To quantitatively assess concordance of subtype classifications and prognostic gene signatures, we used Cohen Kappa coefficient (κ) (67), as implemented in the R package epibasix version 1.1 (http://cran.r-project.org/web/packages/epibasix/); κ ranges from 0 to 1, with 0 indicating no relation and 1 indicating a perfect concordance. Typically qualitative descriptions are associated with intervals [κ ≤ 0.20, slight agreement; 0.20 < κ ≤ 0.40, fair agreement; 0.40 < κ ≤ 0.60, moderate agreement, 0.60 < κ ≤ 0.80, substantial agreement; and 0.80 < κ ≤ 0.99, almost perfect agreement, as described in Weigelt et al. (22)]. To assess the association between subtype classifications and clinical parameters, we used Cramer V statistic (68) as implemented in the R package vcd version 1.2-11 (http://cran.r-project.org/web/packages/vcd/). For comparison, we used the same intervals and descriptions used with κ. Statistical significance of the concordance and association was calculated using the χ2 test.
Subtypes were considered a categorical variable, with no assumption made on order across subtypes. Risk predictions as computed by the prognostic gene signatures were considered ordered categorical variables [low, intermediate, and high-risk groups as defined in the original publications, which are (14) for MAMMAPRINT, (15) for ONCOTYPE, and (16) for GGI, respectively]. Disease-free survival curves (distant metastasis–free survival whenever available, relapse-free survival otherwise) were estimated using the Kaplan–Meier estimator and compared using the two-sided log-rank test as implemented in the R package survival version 2.36-9 (http://cran.r-project.org/web/packages/survival/). To statistically compare the prognostic value of competitive risk prediction models, such as subtype classifiers or published gene signatures, we used a two-sided Wilcoxon signed rank test comparing 10-fold cross-validated partial likelihood (CVPL) (69) as implemented in the R/Bioconductor survcomp package (70), version 1.3.6 [http://www.bioconductor.org/packages/release/bioc/html/survcomp.html; (70)]; the lower the estimate, the better the prognostic value.
In breast cancer classification, there have been two general methodological approaches to developing subtype classifiers (Figure 1). SSPs use hierarchical clustering to identify the main breast cancer subtypes from gene expression data, and a nearest centroid classifier is subsequently built to enable subtyping of a single tumor sample (Figure 1, A). Three versions of the SSPs and their corresponding centroids have been published so far and implemented in our study, SSP2003 (6), SSP2006 (2), and PAM50 [(3); Figure 1, C]. SCMs represent an alternative approach based on a mixture of three Gaussians to represent the main breast cancer molecular subtypes, which are the basal-like, HER2-enriched, and luminal tumors (Figure 1, B); the median of the AURKA module score within the luminal tumors was used to discriminate between the low and high proliferative luminal A and B tumors (31). Two versions of the SCMs have been published recently, SCMOD1 (1) and SCMOD2 (8) (Figure 1, D). Given the statistical and clinical challenges in implementing multigene classifiers, we wanted to explore whether we could simplify SCM-based classification to the smallest possible number of genes. Therefore, we developed SCMGENE, an SCM-based classifier reduced to its simplest form, which uses only the expression of the three key and most representative genes of breast cancer biology, ER, HER2, and AURKA. SCM-based classifiers were trained in the largest gene expression dataset, EXPO [EXpression Project for Oncology; dataset consisting of 353 primary breast tumors collected by the International Genomic Consortium, http://www.intgen.org/expo/; see Table 1].
Validating subtype classification is difficult because the true subtypes are unknown. Tibshirani and Walter (64) developed a new statistic, called the prediction strength, to assess the robustness of a classifier, defined as the capacity of a classification model to assign the same tumors to the same subtypes independently of the dataset used to fit the model. The rationale is that if a classification model is strongly dependent on the training dataset, then it is likely to be unreliable (see “Methods and Statistical Analysis” and Supplementary Methods Part 3, available online). Prediction strength ranges from 0 to 1, and a value of at least 0.8 is characteristic of a robust classifier. We compared the robustness of the classification algorithms underlying the three SSPs and the three SCMs (Figure 1, C and D, respectively) in our large compendium of breast cancer datasets (Table 1). To do this, we fitted each model on a training set (EXPO, Table 1) and estimated their prediction strength on the remaining 32 independent (test) datasets.
Because there is no clear consensus as to the number of breast cancer subtypes (20,21), we analyzed robustness for all six classifiers for identifying both three and four subtypes. For the SSPs, we also considered five subtypes. SCMs are limited to three or four by construction (see Supplementary Methods, Part 1, available online). SCMs yielded median prediction strength (≥0.8) for three and four subtypes (Figure 2, A and B, respectively, Supplementary Tables 2 and 3, available online). SSPs yielded lower prediction strength for three subtypes (median prediction strength, 0.45–0.59), and their robustness dramatically decreased with increasing number of subtypes (Figure 2, Supplementary Tables 2–4, available online). SCMs were statistically significantly more robust than SSPs for the identification of three and four subtypes (two-sided Wilcoxon signed rank tests, Holm's corrected P < .005; median differences, confidence intervals, and P values, Supplementary Table 5, available online), although we observed only a trend to significance for the higher robustness of SCMOD2 compared with PAM50 for three subtypes [median prediction strength of 0.82 vs 0.59 for SCMOD2 and PAM50, respectively; two-sided Wilcoxon signed rank test, median difference = −0.13, 95% confidence interval = −0.03 to 0.26; Holm's corrected P = .078]. SCMGENE yielded the best median prediction strength among SCMs, although its superiority over SCMOD2 and SCMOD1 was not statistically significant.
We used the published SSPs and SCMs (Figure 1, C and D) to assign molecular subtypes to each of the 5715 breast tumor samples in our compendium of datasets (Tables 1 and and2).2). As reported previously (71–73), luminal A/B tumors were the most frequently observed subtype (56%–63%), followed by the basal-like (19%–27%) and HER2-enriched (13%–15%) subtypes. SSPs identified a small percentage of normal-like tumors (11%, 8%, and 5% for SSP2003, SSP2006, and PAM50, respectively). Note that the two oldest SSPs, SSP2003 and SSP2006, identified substantially different percentages of luminal A (43% and 41%, respectively) and luminal B (15%) tumors compared with PAM50 and the SCMs (26%–31% and 31%–36% of luminal A and B tumors, respectively).
We then assessed the concordance between classifiers, quantitatively (using Cohen’s Kappa coefficient, κ, and calculating agreement between classifiers, as measured by the proportion of identical classifications), qualitatively [slight, fair, moderate, substantial, and almost perfect concordance based on ranges of κ (22,74), see “Methods and Statistical Analysis”; Table 3], and graphically using color bars (Figure 3). All models proved to be statistically significantly concordant (fair concordance, κ > 0.34, Holm's corrected P < .001 with 49%–86% agreement). SSPs were globally less concordant with each other than SCMs (fair to moderate concordance for SSPs: κ = 0.45–0.58 vs substantial to almost perfect concordance for SCMs: κ = 0.65–0.81; agreement, SSPs: 58%–68% vs SCMs: 75%–86%). However, the most recently developed SSP, that is PAM50, had moderate to substantial concordance with the SCMs (κ = 0.59–0.68, 70%–76% agreement). SCMGENE was statistically significantly concordant with published SCMs (κ = 0.65–0.70) and SSPs (κ = 0.34–0.59), statistically significantly associated with ER (V = 0.64), HER2 (V = 0.52) status, and histological grade (V = 0.55).
For overall concordance (Figure 3), SCMGENE was used as the reference classifier (Supplementary Figure 1, available online, shows similar results for PAM50 as the reference). The basal-like subtype was the most consistently assigned subtype by all classifiers (basal-like vs the rest, median κ = 0.78). The HER2-enriched and luminal A subtypes were moderately concordant between different classifiers (median κ = 0.55). In contrast, the majority of the luminal B and normal-like tumors were classified differently depending on the classifier (median κ = 0.38 and 0.41, respectively).
We then computed risk predictions using the published algorithms of three prognostic gene signatures, MammaPrint (14) (MAMMAPRINT), Oncotype Dx (15) (ONCOTYPE), and Gene expression Grade Index (16) (GGI), and assessed the concordance between these risk classifications and the subtype classifications (Table 3, Figure 3, and Supplementary Figure 1, available online). Note that, similarly to Fan et al. (17), we did not use the commercially available assay, but we relied instead on the published microarray data to compute the risk classifications. Consistent with Fan et al. (17), basal-like, HER2-enriched, and luminal B tumors were almost all classified as high risk by the prognostic gene signatures, whereas luminal A and normal-like tumors were mostly classified as low risk (moderate to substantial concordance; Table 3) except for MAMMAPRINT, which may yield only fair concordance because of a small proportion of low-risk patients (approximately half of luminal A tumors are still predicted to be high risk; see Figure 3 and Supplementary Figure 1, available online).
We assessed the association between the subtype classifications and clinical parameters (Table 3, Figure 3, Supplementary Figure 1, available online). As expected, the majority of basal-like and luminal tumors were ER− and ER+, respectively (moderate to substantial concordance, Cramer’s V = 0.64–0.71). In contrast, the concordance with progesterone receptor was only moderate (V = 0.46–0.54). Most tumors defined as HER2 enriched are HER2 overexpressed by IHC or amplified by fluorescent in situ hybridization (fair to moderate concordance, V = 0.34–0.52), with the strongest concordance provided by the SCMs (V = 0.48–0.52). Tumors from the basal-like, HER2-enriched, and luminal B subtypes were mostly histological grade 3 tumors (moderate concordance, V = 0.51–0.58).
It is worth noting that no association was observed between the subtype classifications and the tumor size, nodal status, and age at diagnosis, suggesting that these features are independent of the molecular subtype (Table 3 and Figure 3, C).
Using data from the 1260 untreated patients with node-negative tumors (Tables 1 and and2),2), we analyzed the prognostic value of the six subtype classifiers and the published gene signatures. The survival curves were statistically significantly different between subtypes for all six classifiers (Figure 4, A; log-rank test, P < .001) and between the risk groups defined by prognostic gene signatures (Figure 4, B; log-rank test, P < .001). These results confirm the substantial prognostic value of molecular subtyping on early breast cancers without potentially confounding treatment effects. Although the survival curves from the SCMs, including SCMGENE, were virtually identical, SSPs exhibited some discrepancies for luminal B and normal-like tumors. We observed that a small group of 148 patients with luminal B tumors have the worst survival according to SSP2006, a result inconsistent with the other five classifiers (Figure 4, A). The survival curves for patients with tumors classified as normal-like vary depending on which SSP is used; the prognosis of patients with normal-like and luminal A tumors was similar according to SSP2006 (probability of survival at 5 years for patients with normal-like vs luminal A tumors, P = .84 and .87, respectively); normal-like was better for SSP2003 (P = .90 and .80 for normal-like vs luminal A, respectively) but slightly worse for PAM50 (P = .80 and .89 for normal-like vs luminal A, respectively).
Given the good survival of the intermediate-risk group identified by ONCOTYPE (probability of survival at 5 years for patients predicted as intermediate and low risk: P = .93 and .90, respectively; Figure 4, B), we decided to merge it with the low-risk group as in Fan et al. (17). The luminal A subtype defined by the classifiers exhibited a survival similar to the low-risk group defined by the prognostic gene signatures (probability of disease-free survival at 5 years, SCMs: P = .90–.91; SSPs: P = .81–.90; and gene signatures: P = .89–.92; Figure 4). To confirm the similarity of the survival curves, we statistically compared the prognostic value of the subtype classifiers and gene signatures to test whether one classification was better than another. We used a 10-fold CVPL (see “Methods and Statistical Analysis”) to assess the prognostic value of each classification (Supplementary Table 6, available online), and we observed that all subtype classifiers yielded statistically similar prognostic value (CVPL = 1.651–1.667, two-sided Wilcoxon signed rank test, Holm's corrected P ≥ .05; Supplementary Table 7, available online). The gene signatures yielded better prognostic value (CVPL = 1.648, 1.648, and 1.651 for MAMMAPRINT, ONCOTYPE, and GGI, respectively), but they did not statistically significantly outperform the subtype classifiers (two-sided Wilcoxon signed rank test, Holm's corrected P ≥ .05; Supplementary Table 7, available online), except for SSP2003, which appeared to yield statistically significantly worse prognostic value than GGI (two-sided Wilcoxon signed rank test, Holm's corrected P = .035; Supplementary Table 7, available online). These results suggest that none of the subtype classifiers statistically significantly outperform the others and that we lack evidence to claim superiority of the published gene signatures for prognosis. We also showed in a series of 676 tamoxifen-treated patients with ER+ tumors as defined by locally reviewed IHC that tumors identified by SCMGENE and the other subtype classifiers as discordant (either basal-like or HER2-enriched subtype) had a poorer survival (probability of survival at 5 years for patients with either luminal A or B tumors compared with those with discordant ER status: P = .86–.88 and P = .63–.76, respectively) suggesting that these patients did not benefit from tamoxifen therapy (Figure 5).
Despite the widespread recognition of the value of molecular subtyping, the complexity of the classification models, which use dozens to hundreds of genes, and uncertainty about their robustness and clinical relevance have been impediments to their general clinical use (18–21). Furthermore, quality assessment of molecular subtyping is complex because the truth is unknown. Using a collection of data from 5715 breast tumors, we analyzed five previously described classifiers (three SSPs and two SCMs) and compared these to SCMGENE, a simplified SCM-based classifier that uses only three genes that capture key biological processes in breast cancer namely ER signaling, HER2 signaling, and proliferation. We used the prediction strength statistic (64) to quantify robustness of subtype classifications, defined as the capacity of an algorithm to assign the same tumors to the same subtypes regardless of the gene expression data used to build the classifier. We found SCMs to be statistically significantly more robust than SSPs. Moreover, among the SCMs, SCMGENE, our simple three-gene model, was statistically as robust as the published SCMs, which use hundreds of genes.
Each classifier demonstrated fair to substantial concordance, underscoring the validity of the subtypes. Among the molecular subtypes, the basal-like subtype was consistently identified independently of the classifier used. In contrast, the luminal A, luminal B, and normal-like tumors were more difficult to classify, consistent with the recent study of Mackay et al. (21); the separation of the luminal group into A and B was not well supported by our analysis, probably because these subtypes are defined by expression of proliferation-related genes, which exhibit a continuum of expression levels (1,8,20,22). Like others (20,22), we did not find support for the normal-like subtype. It may be that this subtype is an artifact resulting from stromal contamination (22).
In the survival analysis of a large set of untreated node-negative breast cancer patients, we confirmed that all six classifiers had a statistically significant prognostic value (9,22). When assessing concordance with published prognostic gene signatures, we found that the vast majority of basal-like, HER2-enriched, and luminal B tumors were classified as high risk (8,17). Again, all the subtype classifiers and gene signatures yielded statistically similar prognostic value. Notably, we also showed that for a cohort of patients with ER+ tumors defined initially by IHC who were treated with adjuvant tamoxifen monotherapy, those patients with tumors identified by SCMGENE and the other subtype classifiers as basal like and HER2 enriched had a poorer survival, suggesting that these patients may not benefit from tamoxifen therapy. However, the clinical relevance in terms of response to therapy—for example, endocrine or anti-HER2—of those patients classified differently using IHC and gene expression remains unknown.
All subtype classifiers were statistically significantly associated with clinical variables widely used in management of breast cancer patients; the ER+ (IHC) tumors were particularly well identified by SCMOD2 and PAM50, whereas the HER2 amplified/overexpressed (FISH/IHC) tumors were highly concordant with the SCMGENE classification. However, we found no association with the subtype classifiers and tumor size, nodal status, and age at diagnosis. A large study involving central pathology measurement of traditional clinical parameters and gene expression profiling is needed to definitively draw conclusions about the complementarity or superiority of one technology over another; in addition, this would help determine the clinical relevance of the above concordance issues, that is, which method of subtype classification or central pathology using IHC would yield better predictive value for prescription of anti-HER2 or endocrine therapies. Ongoing prospective trials such as the MINDACT may facilitate such comparisons (75). Our data also suggest that accurate and reproducible measurements of ER, HER2, and proliferation can be used for molecular subtyping in breast cancer. This holds true for currently used methods of centrally reviewed IHC for ER, HER2, and Ki67, particularly for large clinical studies. Although IHC has well-known limitations in terms of intra-laboratory reproducibility and subjective and semiquantitative assessment of protein expression, IHC performed in a central laboratory undoubtedly provides significant additional prognostic value compared with local pathology. However, the good technical reproducibility and the quantitative nature of gene expression profiling (58) makes expression-based classification models promising candidates to complement the current IHC markers widely used in breast cancer. Our results also support the use of SCMGENE to provide molecular subtype classification for samples in large meta-analysis studies of gene expression profiling that involve data generated by different laboratories using diverse microarray technologies.
This study has several potential limitations. First, because our collection of breast cancer microarray data is composed of datasets that were retrospectively accrued, the selection of these patients may result in unbalanced distribution of the different molecular subtypes. Second, we used the normalized gene expression data as provided in public databases and authors’ websites; no attempts to renormalize the microarray data were made, although a robust scaling procedure ensured that the gene expressions were similarly distributed across datasets. Third, depending on the dataset, we did not annotate and map some probes used in the subtype classifiers because of the diversity of microarray platforms used in our compendium of datasets (Supplementary Table 3, available online). Fourth, the current implementation of the CVPL does not allow checking and correction for departure from the proportional hazards assumption. Finally, in contrast to SCMs, SSPs rely on hierarchical clustering, which makes automated identification of the main subtypes present in a specific dataset challenging (21); this may have affected their robustness estimations but also highlights the difficulties of using this type of classification method.
In conclusion, our study demonstrated that for breast cancer molecular subtyping, the simplest classification model, SCMGENE, which is based on the expression levels of three key genes and a simple Gaussian probabilistic model, was surprisingly concordant with the more complex published classifiers and yielded similar prognostic value. It also proved to be one of the most robust classifiers because it uses only ER, HER2, and AURKA gene expression, whereas the other classifiers rely on many more genes. The simplicity and robustness of the SCMGENE model provide an opportunity for wide application using a variety of expression data types. Moreover, our results suggest that, at present, for molecular subtyping of breast cancer, three genes provide adequate discrimination for clinical implementation; the clinical and biological relevance of the value of adding more genes remains to be demonstrated.
B.H.-K. was supported by a grant in aid from Fulbright Commission for Educational Exchange for postdoctoral research. B.H.-K. and J.Q. were supported by a grant from the National Library of Medicine of the US National Institutes of Health (R01 LM010129-01). A.C.C. and J.Q. were supported by a grant from the Claudia Adams Barr Program in Innovative Basic Cancer Research. C.S. was supported by the Belgian National Foundation for Research (FNRS), the MEDIC Foundation and the Breast Cancer Research Foundation (BCRF). C.D. was supported by the Belgian National Foundation for Research (FNRS), Belgium. S.L. was supported by the National Health and Medical Research Council of Australia (NHMRC) and the European Society of Medical Oncology (ESMO).
Supplementary File 1 (demo sbt.csv; available online) is a CSV file containing all the data necessary to easily reproduce the results of the article. The file contains the clinical information and the subtype classifications for all the datasets described in Table 1, with duplicated samples removed.
B. Haibe-Kains was responsible for the design and execution of the study and statistical analysis; B. Haibe-Kains, C. Desmedt, A. C. Culhane, S. Loi, G. Bontempi, and J. Quackenbush were responsible for data interpretation and article writing; J. Quackenbush and C. Sotiriou co-supervised the study. All authors read and approved the final article. J. Quackenbush and C. Sotiriou are co-last authors.
The funders did not have any involvement in the design of the study; the collection, analysis, and interpretation of the data; the writing of the article; or the decision to submit the article for publication. C. Sotiriou is named inventor on a patent application for the Gene expression Grade Index (GGI) used in this study. There are no other conflicts of interest.
We thank Mauro Delorenzi and Pratyaksha “Asa” Wirapati for their fruitful collaboration on the development of the Subtype Classification Model and Stefan Bentink for helpful discussion and advice. We also thank Sonal Jhaveri for her editorial assistance.