Search tips
Search criteria 


Logo of jnciLink to Publisher's site
J Natl Cancer Inst. 2012 February 22; 104(4): 262–263.
Published online 2012 January 18. doi:  10.1093/jnci/djr557
PMCID: PMC3283539

Gene Signatures Revisited

The three-gene model of Haibe-Kains et al. (1) in this issue of the Journal continues a recent trend (24) in which a gene signature with only a few genes classifies cancer specimens as well or better than signatures with many more genes. Given that cancer cells involve large numbers of genetic mutations (5), why should a signature with only few genes perform about as well in the classification of tumors as a much longer signature?

One possible explanation comes from theories of cancer biology. It has been postulated that cancer development involves 5–7 rate-limiting steps (5). The few genes selected for classification could be among those most closely associated with the key steps that occur in the development of a subtype, regardless of the role of the mutations in carcinogenesis (5,6). There is also a mathematical explanation in addition to or instead of the biological explanation. For mathematical reasons, with most classification rules, there are diminishing returns with additional classifiers (7). The following example illustrates this phenomenon. Let red and green denote two classes analogous to different tumor subtypes. Let A and B denote markers that could be analogous to genes. Suppose there are 16 red specimens and 16 green specimens. The goal is to use markers A and B to split the specimens into a “classify-as-green” set that is predominantly green and a “classify-as-red” set that is predominantly red. A random classification would, on average, yield 16 misclassifications. Classifying all specimens as red or green would also yield 16 misclassifications. Now consider marker A for classification. The distributions of red and green specimens by level of marker A are summarized in the histograms of points (Figure 1, left). Biologically, it is sensible to consider a single split of the levels of marker A, as opposed to multiple splits yielding disjoint sets. In this example, the single split creates a classify-as-green set on the right side and a classify-as-red set on the left side with seven misclassified points (four red and three green). Now consider both markers A and B for classification. The distribution of red and green points (for the same data with only marker A) is a scatter plot (Figure 1, right). Some shapes of classification regions are not biologically plausible. For example, a checkerboard pattern of red and green regions would be extremely unlikely. Also multiple islands of red points among a sea of green points would not be plausible. Generally a line, a set of lines, or a smooth curve would likely separate the classify-as-red region from the classify-as-green region, and some commonly used methods of classification, such as discriminant analysis, look for the optimum separation of points in this scenario. [Another biologically plausible scenario is for the red set to be entirely surrounded by the green set, and classification models have been developed for this situation (3).] Here, a simple rectangular region is considered, which corresponds to an AND/OR rule: Classify as green if marker A and marker B are each greater than the corresponding cut point and classify as red if marker A or marker B are each less than the corresponding cut point. In the example, use of both markers A and B for classification yields four misclassified points (one red and three green). Thus, when increasing the classification rule from no marker to one marker and two markers, the number misclassified went from 16 to 7 to 4. With each addition of a marker, there is less room for improvement with classification, and some misclassifications will likely remain random noise no matter how many markers are included.

Figure 1
The left panel depicts classification by only marker A with histograms showing seven misclassifications (four red and three green). The right panel depicts classification of the same data by markers A and B with a scatter plot showing four misclassifications ...

The clinical goal of using gene signatures for classifying cancer specimens is to improve treatment decisions. For this goal, the most relevant evaluation in Haibe-Kains et al. (1) is the prediction of survival in untreated patients with node-negative tumor based on subtype. However, are the subtypes really needed for predicting survival? Why not use the three genes identified by Haibe-Kains et al. (1) as the starting point for developing a new rule to classify patients based on survival? With this strategy, investigators can augment the classifiers under consideration (before a final few are likely selected in accordance with diminishing returns) to include clinical variables (such as tumor stage) and expression of genes in specimens collected from the tumor microenvironment. In some studies, gene expression levels have not improved classification performance substantially over clinical variables (8), suggesting that clinical variables should also be considered as classifiers. Given the important role of the microenvironment in carcinogenesis (6,9,10), it is not surprising that gene expression levels from stromal tissue (11) or fibroblast serum (12) have been used in cancer classification. In one cancer classification study (3), the few genes identified as classifiers were thought to be related to the disruption of cell signaling between tumors and the microenvironment. Also, by investigating survival as a direct function of classifiers, investigators can evaluate the gain of including additional classifiers in a medical decision-making framework involving the benefits of correct classification and costs of incorrect classification (13,14).

Of course focusing only on untreated patients provides limited information for making treatment decisions because there is no comparison of outcomes with treated patients. [Haibe-Kains et al. (1) investigated subtypes in a separate series of tamofixen-treated patients but did not perform a comparison.] Markers used for making treatment decisions are sometimes called predictive markers as opposed to prognostic markers,which predict survival in an untreated group (15). To investigate predictive markers in a randomized trial of breast cancer patients, data on the three genes identified by Haibe-Kains et al. (1) and other candidate classifiers could be collected in each randomization group. If the number of additional classifiers under consideration is large, the data should be split into a training sample for selection of classifiers and formulation of the classification rule and a test sample for evaluation (16). One analytic strategy is to fit a risk prediction model for the candidate classifiers in each randomization group and compute a risk difference as the classification rule (17,18). Plotting the estimated difference in survival between randomization groups vs the interval of risk difference provides useful information for identifying subgroups that would most benefit from treatment (18,19). Thus, the three genes identified by Haibe-Kains et al. (1) can be a good starting point for more clinically relevant investigations related to predictive markers.


National Cancer Institute.


1. Haibe-Kains B, Desmedt C, Loi S. A three-gene model to robustly identify breast cancer molecular subtypes. J Natl Cancer Inst. 2012;104(4):311–325. [PMC free article] [PubMed]
2. Geman D, d’Avignon C, Naiman DQ, Winslow RL. Classifying gene expression profiles from pairwise mRNA comparisons. Stat Appl Genet Mol Biol. 2004;3:19. [PMC free article] [PubMed]
3. Baker SG. Simple and flexible classification via Swirls-and-Ripples. BMC Bioinformatics. 2010;11:452. [PMC free article] [PubMed]
4. Wang X, Simon R. Microarray-based cancer prediction using single genes. BMC Bioinformatics. 2011;12:391. [PMC free article] [PubMed]
5. Stratton MR, Campbell PJ, Futreal1 PA. The cancer genome. Nature. 2009;458(7239):719–724. [PMC free article] [PubMed]
6. Soto AM, Sonnenschein C. The tissue organization field theory of cancer: a testable replacement for the somatic mutation theory. Bioessays. 2011;33(5):332–340. [PubMed]
7. Hand DJ. Classifier technology and the illusion of progress. Statist Sci. 2006;21(1):1–14.
8. Dunkler D, Michiels S, Schemper M. Gene expression profiling. Does it add predictive accuracy to clinical characteristics in cancer progression? Eur J Cancer. 2007;43(4):745–751. [PubMed]
9. Bissell MJ, Hines WC. Why don’t we get more cancer? A proposed role of the microenvironment in restraining cancer progression. Nat Med. 2011;17(3):320–329. [PMC free article] [PubMed]
10. Baker SG. TOFT better explains experimental results in cancer research than SMT. Bioessays. 2011;33(12):919–921. [PubMed]
11. Finak G, Sadakova S, Pepin F, et al. Gene expression signatures of morphologically normal breast tissue identify basal-like tumors. Breast Cancer Res. 2006;8:R58. [PMC free article] [PubMed]
12. Chang HY, Sneddon JB, Alizadeh AA. Gene expression signature of fibroblast serum response predicts human cancer progression: similarities between tumors and wounds. PLoS Biol. 2004;4(2):206–214. [PMC free article] [PubMed]
13. Vickers AJ, Elkin EB. Decision curve analysis: a novel method for evaluating prediction models. Med Decis Making. 2006;26(6):565–574. [PMC free article] [PubMed]
14. Baker SG. Putting risk prediction in perspective: relative utility curves. J Natl Cancer Inst. 2009;101(22):1538–1542. [PMC free article] [PubMed]
15. Freidlin B, McShane LM, Korn EL. Randomized clinical trials with biomarkers: design issues. J Natl Cancer Inst. 2010;102(3):152–160. [PMC free article] [PubMed]
16. Freidlin B, Simon R. Adaptive signature design: an adaptive clinical trial design for generating and prospectively testing a gene expression signature for sensitive patients. Clin Cancer Res. 2005;11(21):7872–7878. [PubMed]
17. Vickers AJ, Kattan MW, Sargent DJ. Method for evaluating prediction models that apply the results of randomized trials to individual patients. Trials. 2007;9:14. [PMC free article] [PubMed]
18. Cai T, Tian L, Wong PH, Wei LJ. Analysis of randomized comparative clinical trial data for personalized treatment selections. Biostatistics. 2011;12(2):270–282. [PMC free article] [PubMed]
19. Bonetti M, Gelber RD. A graphical method to assess treatment-covariate interactions using the Cox model on subsets of the data. Stat Med. 2000;19(19):2595–2609. [PubMed]

Articles from JNCI Journal of the National Cancer Institute are provided here courtesy of Oxford University Press