The three-gene model of Haibe-Kains et al. (1) in this issue of the Journal continues a recent trend (2–4) in which a gene signature with only a few genes classifies cancer specimens as well or better than signatures with many more genes. Given that cancer cells involve large numbers of genetic mutations (5), why should a signature with only few genes perform about as well in the classification of tumors as a much longer signature?
One possible explanation comes from theories of cancer biology. It has been postulated that cancer development involves 5–7 rate-limiting steps (5). The few genes selected for classification could be among those most closely associated with the key steps that occur in the development of a subtype, regardless of the role of the mutations in carcinogenesis (5,6). There is also a mathematical explanation in addition to or instead of the biological explanation. For mathematical reasons, with most classification rules, there are diminishing returns with additional classifiers (7). The following example illustrates this phenomenon. Let red and green denote two classes analogous to different tumor subtypes. Let A and B denote markers that could be analogous to genes. Suppose there are 16 red specimens and 16 green specimens. The goal is to use markers A and B to split the specimens into a “classify-as-green” set that is predominantly green and a “classify-as-red” set that is predominantly red. A random classification would, on average, yield 16 misclassifications. Classifying all specimens as red or green would also yield 16 misclassifications. Now consider marker A for classification. The distributions of red and green specimens by level of marker A are summarized in the histograms of points (Figure 1, left). Biologically, it is sensible to consider a single split of the levels of marker A, as opposed to multiple splits yielding disjoint sets. In this example, the single split creates a classify-as-green set on the right side and a classify-as-red set on the left side with seven misclassified points (four red and three green). Now consider both markers A and B for classification. The distribution of red and green points (for the same data with only marker A) is a scatter plot (Figure 1, right). Some shapes of classification regions are not biologically plausible. For example, a checkerboard pattern of red and green regions would be extremely unlikely. Also multiple islands of red points among a sea of green points would not be plausible. Generally a line, a set of lines, or a smooth curve would likely separate the classify-as-red region from the classify-as-green region, and some commonly used methods of classification, such as discriminant analysis, look for the optimum separation of points in this scenario. [Another biologically plausible scenario is for the red set to be entirely surrounded by the green set, and classification models have been developed for this situation (3).] Here, a simple rectangular region is considered, which corresponds to an AND/OR rule: Classify as green if marker A and marker B are each greater than the corresponding cut point and classify as red if marker A or marker B are each less than the corresponding cut point. In the example, use of both markers A and B for classification yields four misclassified points (one red and three green). Thus, when increasing the classification rule from no marker to one marker and two markers, the number misclassified went from 16 to 7 to 4. With each addition of a marker, there is less room for improvement with classification, and some misclassifications will likely remain random noise no matter how many markers are included.
The clinical goal of using gene signatures for classifying cancer specimens is to improve treatment decisions. For this goal, the most relevant evaluation in Haibe-Kains et al. (1) is the prediction of survival in untreated patients with node-negative tumor based on subtype. However, are the subtypes really needed for predicting survival? Why not use the three genes identified by Haibe-Kains et al. (1) as the starting point for developing a new rule to classify patients based on survival? With this strategy, investigators can augment the classifiers under consideration (before a final few are likely selected in accordance with diminishing returns) to include clinical variables (such as tumor stage) and expression of genes in specimens collected from the tumor microenvironment. In some studies, gene expression levels have not improved classification performance substantially over clinical variables (8), suggesting that clinical variables should also be considered as classifiers. Given the important role of the microenvironment in carcinogenesis (6,9,10), it is not surprising that gene expression levels from stromal tissue (11) or fibroblast serum (12) have been used in cancer classification. In one cancer classification study (3), the few genes identified as classifiers were thought to be related to the disruption of cell signaling between tumors and the microenvironment. Also, by investigating survival as a direct function of classifiers, investigators can evaluate the gain of including additional classifiers in a medical decision-making framework involving the benefits of correct classification and costs of incorrect classification (13,14).
Of course focusing only on untreated patients provides limited information for making treatment decisions because there is no comparison of outcomes with treated patients. [Haibe-Kains et al. (1) investigated subtypes in a separate series of tamofixen-treated patients but did not perform a comparison.] Markers used for making treatment decisions are sometimes called predictive markers as opposed to prognostic markers,which predict survival in an untreated group (15). To investigate predictive markers in a randomized trial of breast cancer patients, data on the three genes identified by Haibe-Kains et al. (1) and other candidate classifiers could be collected in each randomization group. If the number of additional classifiers under consideration is large, the data should be split into a training sample for selection of classifiers and formulation of the classification rule and a test sample for evaluation (16). One analytic strategy is to fit a risk prediction model for the candidate classifiers in each randomization group and compute a risk difference as the classification rule (17,18). Plotting the estimated difference in survival between randomization groups vs the interval of risk difference provides useful information for identifying subgroups that would most benefit from treatment (18,19). Thus, the three genes identified by Haibe-Kains et al. (1) can be a good starting point for more clinically relevant investigations related to predictive markers.