A class predictor, or predictive classifi
er, is a computable function which can be used to predict a class from an expression profile. Predictive classifiers can be of considerable importance for guiding treatment selection in medicine although several levels of validation are needed before such classifiers are “ready for prime time”[35
]. In developing a predictive classifier the emphasis should be on predictive accuracy, sensitivity, specificity, and positive and negative predictive values, not on controlling the false discovery rate, goodness of fit to the data, or the statistical significance of regression coefficients.
Development of a predictive classifier requires specification of (i) the mathematical/computational function used to relate an expression profile to a class; (ii) the genes whose expression levels are utilized in the prediction; and (iii) the parameters; e.g. the weights placed on expression levels of individual genes and cut-points used in the prediction [7
]. It is often not recognized that a predictive classifier is not just a set of genes, it is a completely specified computable function that can be used to classify individual patients whose expression profiles are determined.
The development of a predictive classifier is similar to the development of a statistical regression function, except that the former predicts class identifier rather than a continuous value. Statistical regression models are generally built using data in which the number of cases (n) is large relative to the number of candidate variables (p). In the development of class predictors using gene expression data, however, the number of candidate predictors is generally orders of magnitude greater than the number of cases. This has two important implications. One is that only simple class prediction functions should be considered because functions with too many degrees of freedom will over-fit the data and predict poorly for independent samples[37
]. The second important implication is that the data used for evaluating the class predictor must be distinct from the data used for developing it. It is almost always possible to develop a class predictor even on completely random data which will fit the training data almost perfectly but be completely useless for prediction with independent data [6
A wide variety of algorithms have been used effectively with DNA microarray data for class prediction. Many predictive classifiers are based on linear discriminants of the form
denotes the log-ratio or log-intensity for the i'th gene, wi
is the weight given to that gene, and the summation is over the set G of genes selected for inclusion in the class predictor. For a two-class problem, there is a threshold value d, and a sample with expression profile defined by a vector x
of values is predicted to be in class 1 or class 2 depending on whether l
) as computed from equation (1)
is less than the threshold d or greater than d respectively.
Linear discriminant classifiers differ with regard to how the weights are determined. The oldest form of linear discriminant is Fisher's linear discriminant. The weights are selected so that the mean value of l
) in class 1 is maximally different from the mean value of l
) in class 2. The squared difference in means divided by the pooled estimate of the within-class variance of l
) was the specific measure used by Fisher. To compute these weights, one must estimate the correlation between all pairs of genes that were selected in the feature selection step. The study by Dudoit et al.[38
] indicated that Fisher's linear discriminant analysis did not perform well unless the number of selected genes was small relative to the number of samples. The reason is that in other cases there are too many correlations to estimate and the method tends to be un-stable and over-fit the data.
Diagonal linear discriminant analysis
is a special form of Fisher linear discriminant analysis in which the correlation among genes is ignored. By ignoring such correlations, one avoids having to estimate many parameters, and obtains a method which generally performs better when the number of samples is small. Golub's weighted voting method [39
] and the Compound Covariate Predictor of Radmacher et al. [36
] are similar to diagonal linear discriminant analysis and tend to perform well when the number of samples is small. They compute the weights based on the univariate prediction strength of individual genes and ignore correlations among the genes.
Support vector machines are very popular in the machine learning literature. Although they sound very exotic, linear kernel support vector machines do class prediction using a predictor of the form of equation (1)
. The weights are determined by optimizing a mis-classification rate criterion with a penalty for large weights which tends to prevent over-fitting [40
]). Although there are more complex forms of support vector machines, they appear to be inferior to linear kernel SVM's for class prediction with large numbers of genes [41
In the study of Dudoit et al. [37
], the simplest methods, diagonal linear discriminant analysis, and nearest neighbor classification, performed as well or better than the more complex methods. Nearest neighbor classification is based on a set G of genes selected to be informative for discriminating the classes and a distance function d
) which measures the distance between the expression profiles x
of two samples. The distance function utilizes only the genes in the selected set G. To classify a sample with expression profile y
, compute d
) for each sample x
in the training set. The predicted class of y
is the class of the sample in the training set which is closest to y
with regard to the distance function d. the distance function is usually either the standard Euclidean distance or one minus the correlation between the expression profiles x
. A variant of nearest neighbor classification is nearest centroid classification
in which the new expression profile y
is compared to the mean expression profile for the training samples of each class. Those mean expression profiles are called centroids. The shrunken centroid
method of Tibshirani et al. is a popular and effective form of nearest centroid classification which incorporates both automatic selection of the gene set G and adjustment of class specific centroids to account for the bias of gene selection.[42
Dudoit et al. also studied some more complex methods such as classification trees and aggregated classification trees. These methods did not appear to perform any better than the simpler methods. Ben-Dor et al. [41
] also compared several methods on several public datasets and found that nearest neighbor classification generally performed as well or better than more complex methods.
The models described above are generally applied with the gene set G specified by identifying the genes that are differentially expressed among the classes when considered individually. For example, if there are two classes, one can compute a regularized t-test for each gene. The log-ratios or log-intensities are generally used as the basis of the statistical significance tests. The genes that are significantly differentially expressed at a specified significance level are selected for inclusion in the class predictor. The stringency of the significance level used controls the number of genes that are included in the model. If one wants a class predictor based on a small number of genes, the threshold significance level is made very small. Issues of multiple testing or false positives are not really relevant, however, because the objective is just to select features for inclusion in the model and the threshold significance level is just a “tuning parameter”. Some methods do not use p values at all but merely select the k most differentially expressed genes, and specify k arbitrarily or by optimizing using cross-validation.
Several authors have developed methods to identify optimal sets of genes which together provide good discrimination of the classes [43
]. Some of these kinds of algorithms are very computationally intensive. Several independent evaluations have, however, indicated that the increased computational effort of these methods is not warranted[47