The need for an objective, biological-based classification of gliomas is exemplified by the high rate of divergent diagnoses, the inexact prognostic capabilities, and poor therapeutic predictive properties of the current histopathologic classification schemes (10
). Here, we report the development of an unbiased, gene expression–based, histology-independent glioma classification system by applying unsupervised machine learning algorithms to the gene expression profiles using the largest collection of gliomas published to-date, covering the main WHO histologic types and grades. Our analyses show that gliomas can be separated into two main types, O and G. The O main type is further divided into OA and OB subtypes, and the G can be divided into four subtypes, showing hierarchically nested relationships designated as GA1, GA2, GB1, and GB2 (Supplementary Fig. S2B
Our classification system differs from previously reported ones in several important aspects. First, the classes were discovered based on the expression profiles of all glioma histologic types and grades without any a priori exclusion of the samples based on some clinical variable. Second, we did not select the genes to be analyzed from a manually curated list of genes thought to be relevant but rather to include all informative genes to represent the entire transcriptomic profiles of individual glioma patients. Third, we selected the subtype-defining classifiers based solely on their statistical ability to separate the classes rather than our preconceived notion of their functionality. That is, we elected to allow the computational analyses determine the appropriate classification scheme based totally on the biology of a wide spectrum of gliomas, as manifested by their genomic-scale transcriptomic profiles, in the hope that the biology would translate to a clinically relevant classification. By respecting the concordant molecular signature behavior of the tumors in the analyses, we have achieved a truly unsupervised solution. The difference in how this analysis was performed compared with prior studies is not trivial because the classification scheme we have constructed is based purely on observed gene expression profiles, without predefined genetic, pathologic, or clinical assumptions, and thereby represents a glioma classification scheme based solely on unbiased biological data. The power and reliability of the classification system described here rest not only on the purely statistical and objective way in which the analyses were performed but also on the fact that all six glioma subtypes and their corresponding classifiers were derived from a train set, confirmed and validated using an in-house and totally independently generated test set larger than the train set, and further validated using two independent, externally generated data sets.
Despite the use for the first time of a purely unsurpervised methodology, our results are generally consistent with and extend the findings from previously published smaller glioma classification studies (11
). For example, our O main type is very similar to GSE4271 proneural class and our G main type extensively overlaps with the GSE4271 mesenchymal and GSE4271 proliferative classes. More specifically, the GSE4271 proliferative class largely overlaps with our GA subtype, both being represented by up-regulated cell proliferation–associated genes, whereas the mesenchymal class largely overlaps with our GB subtype, both characterized by invasive and mesenchymal tissue–associated genes. Our subtypes, however, further refine this classification system as exemplified by the fact that the OA and OB classification separates the GSE4271 proneural tumors into two different subtypes with significantly different overall survivals. Furthermore, our analyses show that the GSE4271 proliferative and GSE4271 mesenchymal classes can be broken into four different subtypes based on the differences in their classifier signatures.
It should be noted that the lack of survival difference between the most aggressive tumors that fell into the GBM-rich subtypes should not be seen as a weakness in or failure of our classification system. In contrast to diseases, such as breast cancer and lymphoma, that tend to have much greater variability in the natural history of the disease (i.e., survival), the vast majority of glioblastoma patients have a rather uniform and poor survival. Our classification system is consistent with those by Phillips and others and with data from large clinical trials conducted over the last two decades that consistently show a small subtype of GBM patients with better-than-expected survival (as represented by the GBMs in our O main type) but with a majority of GBM patients having uniformly short survival. Thus, although survival can be a correlation of biology, biology is not necessarily a correlation of survival and is the reason that we intentionally chose to perform an unsupervised analysis that was not based on survival, as has been done so often in the past. Certainly, patients with aggressive tumors with vastly different biologies can have roughly equivalent survivals (i.e., GBM, pancreatic cancer, metastatic lung cancer). In the case of tumors within the brain, survival may, in particular, be more a surrogate of tumor growth rate and lack of therapeutic responsiveness than biology given the physiologic constraints of a mass growing within the closed compartment of the cranium and the vital and sensitive nature of the underlying brain tissue. This is likely why nearly all malignant tumors within the brain, whether they be primary tumors (gliomas, medulloblastomas) or secondary tumors (metastatic lung cancer, melanoma), tend to have a uniformly poor and homogenous survival (6–12 months) if the specific tumor is not inherently sensitive to therapy. Nevertheless, the accumulating molecular data clearly shows that, despite survival homogeneity, GBMs are heterogeneous at the molecular and genetic level (11
). Our classification scheme defines and categorizes this heterogeneous biology that will be crucial for designing clinical trials of molecularly target agents that are enriched for patients most likely to respond to therapy and ultimately for the practice of patient-specific or “personalized medicine.” Thus, the clinical utility of these GBM-rich subtypes will only be realized with the acquisition and the analyses of much more corollary clinical data.
We entered into this project with the bias that the standard WHO classification system would be very poorly representative of the underlying tumor biology. For the most part, we found this to be true, for although the WHO classification of GBM versus non-GBM pathology was largely upheld in our two major expression groups (96% of all GBMs in GBM-rich group versus 75% of non-GBMs in the oligo-rich group), tumors designated as WHO grade 2 or grade 3 astrocytomas, oligodendrogliomas, or mixed gliomas randomly distributed between the expression subgroups whether that designation was based on the original pathologic diagnosis (home institution) or by our central pathology review (data not shown). The fact that there was nearly a 30% to 40% discrepancy in the designation of grade and glioma subtype between the original and central pathology review of non-GBM tumors (data not shown) reinforces the subjective nature of the currently used classification system and testifies to the potential problem of using the WHO system to group specific patients into particular biological strata. By contrast, our classification system, derived and based purely on computational analyses of gene expression, consistently groups tumors into one of six groups across four very divergent data sets constituting over 700 gliomas.
The classifier gene sets described here showed >92% prediction accuracy in a 10-fold cross-validation (Supplementary Table S4
). Additionally, the classifiers were successfully used to predict and to assign the derived subtypes in a large independent test set () and further to stratify two external data sets into six subtypes ( and Supplementary Fig. S4
). The robustness of our gene classifiers suggested a potential for a useful clinical tool for glioma diagnosis once more extensive validation is undertaken. Indeed, our classification system allowed us to identify a subgroup of GBMs in the TCGA data base that fell into the oligo-rich group and had markedly better survival than the remainder of the group, demonstrating the potential power of this biologically based classification system.
Finally, our GSEA analysis points to potential functional properties of the different subtypes identified in this analysis. For example, the survival advantages of OA subtype might in part be explained by an intact p53 regulatory pathway as represented by the activity of the ARF pathway and by a tendency toward genomic stability through the maintenance of chromosome ends, as suggested by activity of the Tel pathway. In contrast, the protooncogene signaling of the PDGF pathway (known to be aberrantly regulated in a significant percentage of astrocytomas) may confer higher proliferative properties to tumors in the OB subtype. Needless to say, the true functional significance of pathway activation within tumors in each of these subtypes remains to be elucidated through biological studies. Nevertheless, this analysis, as well as others like it, begins to build a new framework by which basic and clinical scientists can investigate the biological, functional, and clinical significance of these novel molecular classes with the hope of ultimately deriving a tumor classification system that will have both biological and therapeutic predictive value.