Tumors are currently diagnosed by histology and immunohistochemistry based on their morphology and protein expression, respectively. However, poorly differentiated cancers can be difficult to diagnose by routine histopathology. In addition, the histological appearance of a tumor cannot reveal the underlying genetic aberrations or biological processes that contribute to the malignant process. Here we developed a method of diagnostic classification of cancers from their gene-expression signatures and identified the genes that contributed to this classification.
We used the SRBCTs of childhood as a model because these cancers occasionally present diagnostic difficulties. For example, Ewing sarcoma is diagnosed by immunohistochemical evidence of MIC2 expression18
and lack of expression of the leukocyte common antigen CD45 (excluding lymphoma), muscle-specific actin or myogenin (excluding RMS)19
. However, reliance on detection of MIC2 alone can lead to incorrect diagnosis as MIC2 expression occurs occasionally in other tumor types including RMS and NHL (ref. 1
Monitoring global gene-expression levels by cDNA microarrays provides an additional tool for elucidating tumor biology as well as the potential for molecular diagnostic classification of cancer5–8,20–22
. Currently, classification and clustering tools using gene-expression data have not been rigorously tested for diagnostic classification of more than two categories. Other approaches that share the parametric nature of ANNs and have been utilized to classify gene-expression profiles include Support Vector Machines23
. Thus far, these other methods have not been fully explored to extract the genes or features that are most important for the classification performance and which also will be of interest to cancer biologists24
Here we have approached this problem using ANN-based models. We calibrated ANN models on the expression profiles of 63 SRBCTs of 4 diagnostic categories. Due to the limited amount of training data and the high performance achieved, we limited our analysis to linear (that is, no hidden layers) ANN models. Although other linear methods may perform as well, our method can easily accommodate nonlinear features of expression data if required. To compensate for heterogeneity within the tumor samples (which contain both malignant and stromal cells) and for possible artifacts due to growth of cell lines in tissue culture, we used both tumor samples (n = 23) and cell lines (n = 40). Data from these samples is complementary, because tumor tissue, though complex, provides a gene-expression pattern representative of tumor growth in vivo, while cell lines contain a uniform malignant population without stromal contamination. Despite using only NB cell lines for calibrating the ANN models, all four NB tumors among the test samples were correctly diagnosed with high confidence. This not only demonstrates the high similarity of NB cell lines to the tumors of origin, but also validates the use of cell lines for ANN calibration. The calibrated ANN models accurately classified all 63 training SRBCTs and showed no evidence of over-training, demonstrating the robustness of this technique.
A potential difficulty with ANN-based pattern recognition models is elucidating causal links from the output to the original input data. To solve this problem and to identify the most significant genes, we calculated the sensitivity of the classification to a change in the expression level of each gene. We produced a list of genes ranked by their significance to the classification. Using this list, we established that the top 96 genes reduced the misclassifications to zero, which opens the potential for cost effective fabrication of SRBCT subarrays in diagnostic use. When we tested the ANN models calibrated using the 96 genes on 25 blinded samples, we were able to correctly classify all 20 samples of SRBCTs and reject the 5 non-SRBCTs. This supports the potential use of these methods as an adjunct to routine histological diagnosis.
Although ANN analysis leads to identification of genes specific for a cancer with implications for biology and therapy, a strength of this method is that it does not require genes to be exclusively associated with a single cancer type. This allows for classification based on complex gene-expression patterns. For example, the top 96 discriminating genes included not only those that had high (61) or low levels (12 BL and 1 EWS) of expression in one particular cancer, but also genes that were differentially expressed in two diagnostic categories as compared to the remaining two. Of the 16 genes highly expressed only in EWS, two (MIC2
) have been previously described18,25
. MIC2 immunostaining is currently used to diagnose EWS; however we find that although MIC2 detects EWS with high sensitivity, it alone cannot be used to discriminate EWS as it was also expressed in several RMSs.
Our method identifies genes related to tumor histogenesis, but includes genes that may not normally be expressed in the corresponding mature tissue. Of the 14 genes that have not yet been reported to be highly expressed in EWS, 4 (TUBB5, ANXA1, NOE1
were neural-specific genes—lending more credence to the proposed neural histogenesis of EWS (ref. 30
). Twenty genes were highly expressed only in RMS, including eight specific for muscle tissue and five (FGFR4
related to myogenesis. Among the latter, IGF2
expression has been reported in RMS (refs. 35,36
), and only ITGA7
were found to be expressed in our two normal muscle samples. Of the genes specifically expressed in a cancer type, 41 have not been previously reported, including 7 ESTs with no current known function. All of these warrant further study and might provide new insights into the biology of these cancers. For example, FGFR4, a tyrosine kinase receptor that is expressed during myogenesis and prevents terminal differentiation in myocytes15,17
, was found to be highly expressed only in RMS and not in normal muscle. The relatively strong differential expression of FGFR4 in RMS was confirmed by immunostaining of tissue microarrays (data not shown). Although the high expression of FGFR4 in most cases of RMS indicates that it may be relevant to the biology of this tumor, it is also expressed in some other cancers37
and normal tissues38
. This indicates that although FGFR4 expression in RMS may be of biological and therapeutic interest, it is unlikely to be applicable as a sole differential diagnostic marker for these tumors.
As the main purpose of this study was to optimize the classification of these cancers, we used a stringent quality filter to include only the genes for which there were good measurements for all samples. This may remove certain genes that are highly expressed in some cancers, but not expressed in other cancers, or may appear not to be expressed because of an artifact in a particular cDNA spot. However, we found that this quality filtration produced more robust prediction models and led to the identification of a set of 96 genes highly relevant to these cancers. Nonetheless, we expect that this list can be expanded by the use of more comprehensive arrays and larger sample sets for training.
Here we developed a method of diagnostic classification of cancers from their gene expression signatures using ANNs. We also identified in ranked order the genes that contributed to this classification, and we were able to define a minimal set that can correctly classify our samples into their diagnostic categories. Although we achieved high sensitivity and specificity for diagnostic classification, we believe that with larger arrays and more samples it will be possible to improve on the sensitivity of these models for purposes of diagnosis in clinical practice. To our knowledge, this is the first application of ANN for diagnostic classification of cancer using gene-expression data derived from cDNA microarrays. Future applications of these methods will include studies to classify cancers according to stage and biological behavior in order to predict prognosis and thereby direct therapy. We believe this offers an alternative and powerful technique for the detection of gene-expression signatures, and the discovery of novel genes that characterize a diagnostic subgroup may also identify new targets for therapy.