We evaluated the calibration and refinement of the high-dimensional classifiers described in the previous section and explored the factors that affect these measures using simulation.

We generated our training and test data sets using multivariate normal gene expression with class-specific mean vectors and common covariance matrices. Let *x* denote a *p*-dimensional gene expression vector and *c* a 0-1 valued class variable. For the *i*th case, *c*_{i} is generated from the Bernoulli distribution with probability 0.5 and vector *x*_{i} in class *k* is generated from multivariate normal distribution with mean *μ*_{k} for *k* = 0,1 and common correlation matrix Σ. The simulations were performed with *p* = 1000 genes, the first *p*_{1} = 50 of which were differentially expressed between classes with mean difference 1. The remaining 950 genes had the same mean for each class.

For the common correlation matrix Σ, we used the following 3 structures:

where Σ

_{11} and Σ

_{22} are

*p*_{1}×

*p*_{1} and (

*p* −

*p*_{1})×(

*p* −

*p*_{1}) intra-class correlation matrices for

*p*_{1} informative genes and for

*p* −

*p*_{1} noninformative genes, respectively. Σ

_{12} is the

*p*_{1}×(

*p* −

*p*_{1}) correlation matrix between informative and noninformative genes and

*I*_{p − p1} is the (

*p* −

*p*_{1})×(

*p* −

*p*_{1}) identity matrix. For each structure, we compared the classifiers by imposing common pairwise correlations of either 0.25,0.50, or 0.75. The case of full independence was also studied. The number of training samples was either 30, 60, or 120. We generated 1000 simulated training samples for each combination of conditions.

PAM and the penalized logistic regression methods incorporate a tuning parameter that determines the amount of shrinkage. For the former, the class means of each variable are shrunken; for the latter, the regression coefficients are shrunken. For the BCC methods, the tuning parameter is the number of variables selected for inclusion in the linear projection. In classification problems, such tuning parameters are usually optimized to minimize a cross-validated estimate of prediction error. Since we are interested in probabilistic prediction, we maximized the cross-validated predictive likelihood. That is, the cross-validated predictive likelihood is computed for each of a grid of values of the tuning parameter and the value with the largest cross-validated predictive likelihood is selected. We used LOOCV (Molinaro

*and others*, 2005). The predictive likelihood of the training data can be written as

where

is computed in a training set omitting the

*i*th case and using tuning parameter

*λ*.

Before we summarize our results for the 1000 simulations for each condition, we first examine some results of a single simulation. Figure 1 shows results for a single test set with training sets of size *n* = 30,60, and 120 when correlations among informative genes are 0.25 (“Structure.1”). A probabilistic classifier of each type has been developed for each of the 3 training sets. A single true model was used to generate these 3 training sets, and we sampled 5000 test sample points (*x*_{i},*c*_{i}) for this model. For each test sample *x*_{i} vector, we compute the true probability that a sample with that *x*_{i} vector is from class 1 by employing Bayes theorem using the true class-specific densities used to generate the data. This true model is the Bayesian probabilistic classifier and these true posterior probability values are used for the vertical axes in . For each test sample *x*_{i}, we also compute the predicted probability of the case being from class 1 for each of the 6 predictive classifiers developed in the training set. These values are used for the horizontal axes.

Figure 1 shows that evaluating probabilistic classification is more complex than for deterministic classification. For example, 2 classifiers having similar misclassification error rate can produce completely different graphical display. In the figure, the misclassification rates of BCC when *n* = 30 and PLR L2 when *n* = 60 are both close to 0.19 but the corresponding scatter plots are quite different.

Figure 1 also shows that the rate of convergence to the true Bayesian classifier varies dramatically among classification methods. It is desirable that as the sample size increases, the predictions of a classifier converge to the predictions of the true Bayesian probabilistic classifier and thus lie along the 45-degree line. PAM and BCC type classifiers show clear convergence, while PLR L1 appears to converge slowly and PLR L2 does not converge to the Bayesian probabilistic classifier. The dependency condition for Figure 1 is “Structure.1” with common correlation 0.25 among informative genes. The penalized logistic models do not perform well in this rather weak dependency condition.

The calibration curves in (2.8) corresponding to are shown in . The calibration curves are smoothed with local regression with degree 1 and span 0.75 using R software loess package as in

Venables and Ripley (2002) (Chapter 12, p. 350). CS and RS are based equal-sized bins of the unit interval as indicated in Section 2.2. At the right bottom of each figure, CS, RS and misclassification rates are reported.

In Figure 2, we see that evaluating probabilistic classifiers based only on misclassification rate can be misleading. When *n* = 120, PLR L2 has better misclassification rate 0.17 than the 0.20 for PLR L1. However, the probabilistic predictions of PLR L2 are hardly informative because most of them are centered around 0.5 regardless of true probabilities; PLR L2 is poorly calibrated and too conservative. In this case, PLR L1 is preferred.

Although the smoothed versions of many calibration curves in Figure 2 lie close to the diagonal line of the Bayesian probabilistic classifier, Figure 1 shows that their predictions are not necessarily in agreement with the true Bayesian probabilistic classifier in this high-dimensional setting. Approximation to the Bayesian probabilistic classifier is a more stringent requirement than well-calibratedness or refinement.

Figure 2 also shows conservativeness and anticonservativeness of probabilistic classifiers. The PLR L2 is conservative as mentioned above. On the other hand, BCC for *n* = 30 shows anticonservativeness. Its predictions tend to be closer to either 0 or 1 than the true class probabilities. This anticonservativeness is somewhat reduced in BCCi.

Based on the 1000 replications, we plotted average CS, RS and misclassification rates in . For each of three structures and four correlations 0.0,0.25,0.5,0.75 for each structure, we compared the six classifiers for training sample size

*n* = 30. The table corresponding to Figure 3 is included in the

Supplementary Materials.

Overall, PLR L2 appears poorly calibrated, whereas PAM, BCCm, and BCCi appear the best calibrated among the probabilistic classifiers evaluated. PAM appears most refined. PLR L1 and PLR L2 appear less refined than the others except on “Structure.3” with strong correlations where their RS are substantially better than the others. With regard to misclassification rates, BCC type classifiers and PAM are similar. The superiority of PLR L1 on “Structure.3” with strong correlation 0.75 condition is considerable.

Misclassification rates of PLR L1 are considerably lower than the other classifiers on “Structure.3”. It seems that the stage-wise variable selection process (Efron *and others*, 2004) of PLR L1 enables the method to more efficiently exclude noise variables strongly correlated with informative genes than the other classifiers. However, it does not perform as well under other conditions with no correlations between informative and noninformative genes.

The original BCC generally has poorer CS and RS than BCCm and BCCi. PAM seems well calibrated and refined in most common correlation structures. However, the misclassification rate of PAM appears large for

*n* = 30 when there are strong correlations among genes. The poor performance of PAM with strong correlations is not surprising because PAM assumes diagonal covariance structure. This misclassification rate decreases as the training sample size increases. Tables and Figures for

*n* = 60,120 are included in the

supplementary material available at

*Biostatistics* online.