|Home | About | Journals | Submit | Contact Us | Français|
Constructing a confidence interval for the actual, conditional error rate of a prediction rule from multivariate data is problematic because this error rate is not a population parameter in the traditional sense—it is a functional of the training set. When the training set changes, so does this “parameter.” A valid method for constructing confidence intervals for the actual error rate had been previously developed by McLachlan. However, McLachlan's method cannot be applied in many cancer research settings because it requires the number of samples to be much larger than the number of dimensions (n >> p), and it assumes that no dimension-reducing feature selection step is performed. Here, an alternative to McLachlan's method is presented that can be applied when p >> n, with an additional adjustment in the presence of feature selection. Coverage probabilities of the new method are shown to be nominal or conservative over a wide range of scenarios. The new method is relatively simple to implement and not computationally burdensome.
Recent advances in multiplex and high-throughput assays in cancer research have sparked growing interest in the development of classifiers from multivariate data. Such classifiers could lead to clinical tests that would provide doctors and patients with valuable information on prognosis or likely therapeutic response. Although there are many methods for obtaining valid point estimates for a classifier's accuracy, the number of methods for constructing a confidence bound for a predictor's accuracy (or error rate) remains limited. In this paper, a new method is presented for constructing a confidence bound for the actual (true) accuracy of a prediction rule, which is the accuracy a predictor developed on a data set will have when applied to future samples from the population. This new method overcomes limitations of previous methods which needed to have n >> p and can be applied to p >> n situations. A further adaptation of the method is developed for the setting where p >> n, and there is also a dimension-reduction step.
First, a point of clarification seems needed. Confidence intervals or bounds are typically constructed for population parameters. In the classical formulation, a (1−α)100% confidence interval for a parameter θ based on data x satisfies for every θ, where I(x) is an interval constructed based on the observed data x (see, e.g. Rao, 1973). Here, the I(x) notation indicates that the constructed interval is a function of the data x. In words, one typically says that intervals constructed in this way will, in the long run, contain θ of the time. But the actual, conditional error rate is not a population parameter like θ in this formulation because the value of the actual error rate depends on the samples selected for the training set. Therefore, the classical formulation is modified slightly, becoming . For further discussion of the actual (true) error rate, the reader is referred to McLachlan (1992) or Efron and Tibshirani (1997).
Methods for constructing confidence intervals for population parameters are usually divided into approximate versus exact or parametric versus nonparametric. Approximate methods for interval construction utilize the approximate distribution of a test statistic (usually based on the central limit theorem) and are generally based on normal quantiles, whereas exact methods are based on quantiles of the exact distribution of the test statistic for specific parameter settings. Parametric methods assume that the data come from a specified parametric probability model. Confidence intervals can be based either on mathematical computations or on Monte Carlo; by generating many simulated data sets from the population distribution, quantiles of the test statistic distribution associated with particular parameter settings can be estimated, facilitating interval construction. Nonparametric methods assume only that the sample is random and representative of the population. The most common nonparametric method is the nonparametric bootstrap (Efron, 1985, 1987; Efron and Tibshirani, 1993), which estimates quantiles of the test statistic distribution by simulating data sets based on resampling of the original data rather than on an assumed probability model.
When the quantity of interest is not a population parameter, such as the actual error rate, then many of the methods described in the last paragraph cannot be utilized. For example, suppose one assumes a parametric model and one wants an interval for the actual error rate. Then, a pure Monte Carlo approach would fix settings of the parameters and generate many Monte Carlo simulation data sets to obtain the corresponding distribution of the test statistic given those parameter values. But note that this approach “will not work.” The problem is that, because the actual error rate is a function of the training data, every time a new data set is generated, even though the parameters are fixed, “the quantity of interest changes.” Therefore, simple Monte Carlo results will not produce the distribution of the test statistic conditional on the “quantity of interest” (the actual error rate), because simple Monte Carlo conditions on the wrong thing—namely, some underlying population parameters that are most likely nuisance parameters. Thus, even in the relatively simple case of an assumed parametric probability model, it is not straightforward to construct an interval. Not surprisingly, the situation gets even more complex in the nonparametric setting. Although obtaining a point estimate for the actual error rate is not problematic (e.g. Efron and Tibshirani, 1997), using resampling methods to obtain a valid nonparametric confidence interval for the actual error rate is.
McLachlan (1975, 1992) produced a method for obtaining a confidence interval for the actual error rate. The method is parametric and approximate as described above, being based on the multivariate normal model and estimating test statistic quantiles using normal quantiles and an estimate of the standard deviation. The estimation is based on the empirical Mahalanobis distance between the class means, (e.g. Devroye and others, 1991, p. 30), where is the vector of class means for samples from class i and is the inverse sample covariance matrix. The Mahalanobis distance can only be estimated well when n >> p, so that McLachlan's method is restricted to this setting. The method presented in this paper replaces McLachlan's statistic, the sample Mahalanobis distance, with the cross-validated accuracy and then uses exact methods to construct the interval. The use of exact methods is made possible by first identifying a subspace associated with a particular error rate, obtaining the conditional distribution of a linear predictor over that subspace, and then the conditional distribution of sample sets of size n given that point on the subspace; then Monte Carlo methods can be applied to estimate required quantiles. The new method therefore does not require inversion of the sample covariance matrix and can be used in the setting of high-dimensional data. Also importantly, the methodology is extended to the setting of high-dimensional data with a dimension-reducing feature selection step. Feature selection can have important effects on cross validation (see, e.g. Ambroise and McLachlan, 2002; Simon and others, 2003), and ignoring these effects will lead to inadequate statistical procedures, so that this extension is critical.
There are many types of bounds for classifiers in the pattern recognition literature. The Vapnik–Chervonenkis theory (e.g. Devroye and others, 1991) addresses minimization of the empirical risk of a classifier. The Chernoff and Bhattacharyya bounds (e.g. Duda and others, 2001; Fukunaga, 1990) provide probability bounds for the expected error rate. Rogers and Wagner (Devroye and others, 1991) provide an upper bound on the difference between the actual error rate and the cross-validated error rate. Although these bounds can give important insights into the classification problem, there is no straightforward way to extend them to produce statistical confidence intervals (other than 100% confidence level intervals).
In the particular applied setting of microarray data, Michiels and others (2005) used a resampling procedure to construct error rate confidence intervals on a number of microarray data sets. A comparison of bootstrap methods, and new bootstrap approach, on microarray data was presented in Jiang and others (2008). Xu and others (2006) examine the joint distribution of the estimated and true error for a variety of feature-label distributions, classification and estimation rules. Confidence interval construction is closely related to sample size determination, and sample size methods for classifier development in microarray studies were presented in Fu and others (2005), Mukherjee and others (2003), and Dobbin and Simon (2007).
This paper develops a novel approach to using cross-validated estimates to construct a confidence interval for prediction error. This general approach is described in Section 2. Section 3 presents mathematical development results for the normal homoscedastic model, the extension to the case of gene selection, and related discussion. Section 4 presents the results of the Monte Carlo simulations assessing the coverage probabilities and of the applications to several high-dimensional microarray data sets. Section 5 briefly summarizes the conclusions.
Suppose a population can be partitioned into 2 classes, C1 and C2. A collection of samples from the population will be used to develop a classification function that produces predictions based on a vector of observations x on an individual. The parametric model stipulates the distribution in each class. Also suppose that a set of n samples has been studied and a specific predictor developed. Call the predictive function is a function that classifies any sample into either C1 or C2 based on the observation x.
Suppose a k-fold cross validation is used to estimate the actual error rate of . Call this error rate . A upper confidence bound on the actual error rate of is
where is the estimated cumulative distribution function of the cross-validated error rate for the fixed true error rate ϵ. This is the usual exact approach to confidence bound or interval construction (see, e.g. Stuart and others, 1999). If one knew the distribution function, G of the cross-validated error rates for a fixed actual error rate, then the confidence interval would be truly “exact.” In actual practical implementation with estimated , the value of ϵUpperbound is computationally burdensome to estimate precisely, but in real-world settings, a rough 2-digit approximation is usually adequate and this level of precision can be obtained easily. To simplify presentation, discussion will focus on leave-one-out cross validation (LOOCV) for the rest of this paper.
The multivariate normal homoscedastic model for the data vectors is
where is the multivariate normal density with mean vector μ and covariance matrix Σ and p1 is the proportion in the population from class C1. To simplify presentation, the population is centered around zero. To further simplify presentation, it will be assumed that and that are sampled at random from each class for the training set.
In Appendix A, it is shown that a linear predictor L with actual error rate ϵ must satisfy the equation
where . Φ is the cumulative normal distribution function. This subspace is denoted as . This result is a natural extension of results from Dobbin and Simon (2007). If the covariance matrix Σ is known and nonsingular, then a data transformation can produce an identity covariance matrix of the form , where denotes a p by p identity matrix. It is shown in Appendix B that when , is the cone of points that form the angle
with the vector μ. This identifies the solution subspace in the p-dimensional space of L. The solution space is diagrammed in Figure 1.
Geometry of the solution space.1
For mathematical simplicity, this paper will develop the model using the predictor
where is the vector of means for class . The joint distribution of the sample, , and L, conditional on μ and Σ, can then be used to calculate the marginal distribution of L and L given .
Given μ, fixing the actual error rate ϵ is equivalent to restricting the distribution of L to the subspace . However, μ is unknown and must be estimated. Moreover, the larger , the smaller . The actual error rate will center around the expected error rate with some variation, and if n is reasonably large, this variation will likely be reasonably small. In other words, . Therefore, a reasonable approximation for can be obtained from the relationship between and . For a range of values of , Monte Carlo is used to estimate the average error rate corresponding to each value of . This results in a function . Then, −1 is used to estimate an approximate for that ϵ. Alternatively, the approach of Moran (1975) can be used to construct .
Define a scalar μϵ =−1(ϵ). In Appendix C, it is shown using the delta method and the properties of the noncentral chi-squared distribution that the conditional mean and variance of the length for a given ϵ can be approximated by
These formulas can be used to generate samples over with the appropriate mean and variance. The generation can be simplified by using a normal distribution and exploiting the symmetry by restricting simulation to a single ray of the cone of .
Finally, note that it is important to estimate the distribution over adequately because the relation between the actual error rate of the predictor and the LOOCV error rate will be affected by where on the subspace the predictor L falls. In particular, the LOOCV error rate estimate will not be nearly unbiased for all points on the surface . This issue is discussed further below in Section 4.
Section 3.2 provides a method for generating L . In other words, the method generates data from the marginal distribution of L given ϵ utilizing the joint distribution of L and the sample given μ. The conditional distribution of the sample given L is derived in Appendix D and shown to be as follows:
Here, indicates a matrix of 1s with m rows and n columns and indicates the Kronecker matrix product (Hocking, 1992, p. 672). This formula is used to generate samples conditional on L. In simulations, as discussed previously. In some high-dimensional cases, generating samples with this covariance structure may be computationally demanding and it may be adequate to drop the final variance term which should be small for large n, as has been done in the high-dimensional examples in Section 4.
When estimating coverage probabilities of confidence intervals for the actual error rate, the predictor consists of both L and a cutoff value or cut point to be used for classification. The cut point is usually chosen empirically from the data. Under compound covariate prediction (CCP) used in this paper, the cut point is k =1 − 2, where i is the mean prediction score for samples from class i. Suppose k is the cut point used for classification. Then, it is easy to show that the actual accuracy for a linear predictor L with this cutoff is
The mathematical development is based on simplifying assumptions. In non-simulation settings, the covariance matrix is not known and needs to be estimated. If the sample size is large and the dimension is small, then a good estimate of the covariance matrix is possible and the new procedure can be used by substituting the estimated S for Σ. If the dimension is large, then estimation of the covariance matrix will generally not be feasible. A common approach in high dimensions is to assume the covariance matrix is diagonal (Dudoit and others, 2002) and proceed under this assumptions using a standard predictor such as CCP (Radmacher and others, 2002).
When assessing coverage probabilities for the unknown covariance Gaussian case, the predictor L described above will be replaced with the predictor
where and . Here, and S1,i2 are the sample mean and variance in class C1 for feature i, respectively. It is not clear a priori what impact estimating the covariance matrix elements will have on coverage probabilities. Monte Carlo will be used to assess the impact.
In high dimensions, it is common to select features for inclusion in the predictor. Suppose the predictor development algorithm stipulates that q features will be selected from the full set of p features, with . These “selected” features may be either linear combinations of the existing features resulting from dimension-reduction methods, such as principal components, or they may be a simple subset of the existing features, such as a list of q genes involved in a particular gene ontology category, or with the smallest p-values. Denote the Mahalanobis distance between the class means in the original space by and in the reduced q-dimensional space by . Then, it is easy to see that . This implies that the optimal performance (often called the Bayes error rate) of a predictor is always as good or better in the full dimensional space than in the reduced dimensional space. However, the expected performance associated with a predictor developed on a “finite sample size” may, on the contrary, be better in the reduced dimensional space.
The geometry in the reduced dimensional space is affected by dimension reduction. For example, μ may be shorter after feature selection, reflecting missed informative features. In LOOCV with feature selection, the geometry varies during each LOOCV step. Let be the actual accuracy and the dimension of the predictor developed on the full data set. Similarly, let be the accuracies and dimensions of the predictors developed during LOOCV. For n reasonably large, ≈ a0 since removing a single sample from the training set will not have a large impact on the form of the predictor. Similarly, ≈ q0 at least to an order of magnitude and depending on how the dimension reduction is done. In both cases, there may be some downward bias for the smaller sample size, but this should be small as long as n is not too small.
The previous method for estimating needs to be revised in the context of gene selection. Recall that the previous method was based on first a Monte Carlo estimation of the functional relation . Now, let be the function corresponding to a modified algorithm with a gene selection step. This function can be estimated easily by Monte Carlo in the same way was estimated, except that the Monte Carlo will include the gene selection step. When this was done, it was found generally that when μ is in the reduced dimensional subspace—in other words, the expected error rate associated with the same μ was higher with gene selection than without gene selection. This is intuitively the right direction since there is some cost associated with looking through a large number of genes p to get down to a selected set q which has to be paid in the case but not in the g case; hence, the former requires larger differences between the populations to achieve the same accuracy.
What really matters to this current methodology is how the switch from g to affects corresponding percentiles of the LOOCV accuracy distribution—that is, the percentiles on which confidence bounds are based. Note that the methodology uses Monte Carlo and (3.4) to estimate , and thus , and then generates many data sets over . The empirical distribution function (EDF) of the LOOCV error rates of these data sets serves as in Section 3, and quantiles of this EDF correspond to confidence bounds. Figure 2 shows one example of the impact of gene selection. It shifts LOOCV accuracy percentiles upward toward 1, indicating that failure to perform this adjustment will result in anti-conservative coverage probabilities.
Comparison of estimated 90th percentiles of distribution of LOOCV accuracies using naive binomial method versus new method without gene selection versus new method with gene selection. X-axis is the true accuracy of the predictor. Quantiles based on 1000 ...
These considerations lead to the following approach: in the presence of feature selection, apply the method with the modifications that (1) when estimating , use in (3.2) and (3.3) and (2) when estimating , the dimension should match the dimension of the selected feature space of the predictor developed on the full data set of n samples.
Table 1 shows the coverage probabilites of the actual accuracy of the compound covariate predictor, , in low-dimensional settings. The naive binomial method computes an exact bound based on the observed LOOCV accuracy and using the exact binomial method implemented in “R” version 2.6.2. Comparing the new method to the binomial, note that the new method is overall more conservative and provides coverage probabilities that are equivalent to the nominal or conservative. The binomial does adequately for the most part in these settings but does display some of the expected anti-conservative bias when and . McLachlan's method is also shown for comparison and performs similar to the new method except as the dimension approaches the sample size, for example, where and . Then, the new method maintains the correct coverage whereas McLachlan's is extremely anti-conservative. Even with and , McLachlan's method shows some anti-conservative tendency.
Table of coverage probabilities for compound covariate predictor Lu of (3.5) in low dimensions. |μ|=1 for all entries. Coverage probabilities estimated from 1000 Monte Carlo simulations
Table 2 shows the coverage probabilities in high-dimensional settings with dimensions. In this case, the comparison is only to the naive binomial method because McLachlan's method cannot be used when the sample covariance matrix is not invertible. The new method tends to be more conservative than the binomial and better maintains the nominal coverage probabilities throughout. Indeed, the binomial coverage probabilities are mostly below the nominal.
Table of coverage probabilities for compound covariate predictor Lu of (3.5) in high dimensions. Dimension p=500 for all entries. Coverage probabilities estimated from 1000 Monte Carlo repeats for n=60 and 200 Monte ...
Table 3 shows the results from high dimension with gene selection. Gene selection appears to negatively impact the coverage probabilities of the naive binomial method, which are here badly anti- conservative. This is probably a result of the fact that the gene selection step of choosing only genes with large differences between the classes will tend to result in “longer” predictors—and as was seen above, the relation between the true error rate and the LOOCV error rate is mediated through the predictor's length. The simple binomial method makes no correction for this fact. However, the new approach does correct for this fact, and one can see this reflected in the coverage probabilities for the new method, which are all at or above the nominal. Also presented for comparison is the bootstrap approach of Jiang and others (2008), which performs similar to the new method. Therefore, the new method provides conservative intervals in the setting of gene selection.
Table of coverage probabilities of Lu in high dimensions with gene selection. In each case, the 12 genes with the most significant p-values from pooled-variance t-tests were selected for use in the predictor. Data are generated with DEG number of informative ...
Table 4 shows the application of the method to several real data sets, along with comparison to the simple binomial method on the same data sets. In each case, there is considerable difference between confidence bounds based on the new method compared to the binomial method. Based on the simulations, this suggests that the binomial intervals for these data sets may be inadequate.
This paper presented a new method for constructing confidence bounds for the actual error rate of a predictor. The new method was compared to the naive binomial approach and to 2 previously published methods. The new method appears to obtain nominal coverage probabilities in low dimensions, high dimensions, and high dimensions with gene selection, whereas alternative methods sometimes fail to achieve nominal coverage. In high dimensions with gene selection, the new method performed similarly to the bootstrap approach of Jiang and others (2008). Interestingly, the naive binomial method performed particularly poorly in the setting of high dimensions with gene selection, indicating the method is ill-suited to this setting which is most common in microarray experiments. In contrast, the new method is able to capture the effect of gene selection by a relatively simple adjustment, resulting in nominal or slightly conservative coverage probabilities. The methods were also applied to several real microarray data sets.
Another interesting result from this work is that it provides insight into why the feature selection step results in particularly anti-conservative naive binomial method coverage probabilities. The feature selection step will generally impact the average length of a linear predictor. As a result, the distribution of the linear predictor over the subspace corresponding to a fixed true error rate (the cone in Figure 1) is different if feature selection has occurred. The predictor is longer in the gene selection case. But, most importantly, the Monte Carlo investigations showed that the relation between the actual error rate of a predictor and the LOOCV error rate of a predictor is mediated through the length of the predictor. That is, for a fixed actual error rate (e.g. a point on the cone in Figure 1), say of 5%, longer predictors will on average result in LOOCV error rates < 5% and shorter predictors will on average result in LOOCV error rates . This inflates the anti-conservative tendency of the naive binomial intervals.
Robustness of methodologies is a common concern, particularly in microarray studies. The multivariate normal distribution may be violated here. As a result, it makes sense to apply several methodologies to a data set to assess robustness of results. For the specific context of constructing a confidence interval for prediction error, one could apply both the parametric method presented here and the bootstrap method in Jiang and others (2008) to get some assurance of the robustness of the interval result. Also note that the simulations in this paper focused on n reasonably large in absolute terms (). The performance of the method when n is small in absolute terms (e.g. ) has not been studied extensively, and application may require further modifications (e.g. second-order Taylor series approximations in (3.2) and (3.3)).
This paper has made some additional simplifying assumptions, such as that the prevalence from each class in the population is the same, that there are only 2 classes, that the covariance matrix is the same within each class, and that the population is centered around zero. If the prevalence from each class is not equal, then the (3.4) should be modified so that fractions reflect the true prevalences p1 and and the Monte Carlo procedures should be modified throughout so that the data sets generated reflect the population prevalence. If there are more than 2 classes, then the geometric approach presented here will become more complex and it may not be feasible to adapt this approach to that setting. If the diagonal elements of the covariance matrix differed by class in high dimensions, then CCP and corresponding gene selection would need to be adjusted accordingly. Once a substitute for was established, coverage probabilities would need to be evaluated for the new predictor. Finally, note that the assumption of centering around zero is only made to simplify the mathematical presentation and such centering is not critical to either the basic mathematical results or the CCP method coverage assessments.
This paper has focused on the predictor development method known as CCP with or without feature selection. The method can be adapted to diagonal linear discriminant analysis (DLDA) by substituting DLDA predictors for CCP predictors in the Monte Carlo generation of the LOOCV error rates used to estimate percentiles of the LOOCV distribution. Coverage probabilities would need to be reassessed. Adaption to other linear predictor development algorithms may be possible but require more extensive modifications.
Finally, note that the computational cost required by this approach is higher than McLachlan's approach but still reasonable. McLachlan's approach is a function of the sample Mahalanobis distance. The new approach requires some preliminary Monte Carlo work to estimate g or which is the functional relation between the Mahalanobis distance and the average error rate. In R, steps of this search took 40.860 s for 50 Monte Carlo in the dimension-reduction scenario of Table 3 with and . Then, a simple search algorithm with another Monte Carlo is used to find the true accuracy which places the corresponding percentile of the LOOCV accuracy just to one side of the observed LOOCV accuracy. The C++ execution time for the Monte Carlo program was 60.497 s to perform 200 simulations in same scenario. This search can start at the percentile corresponding to the binomial confidence bound and should then converge in a handful of steps assuming high precision is not required.
All compound covariate analyses of actual microarray data sets were performed using Biometric Research Branch ArrayTools developed by Dr Richard Simon and Amy Peng Lam. Real data sets were downloaded from the Biometric Research Branch-ArrayTools Data Archive for Human Cancer Gene Expression (http://linus.nci.gov/~brb/DataArchive_New.html). All other calculations were carried out using programming languages R and C++ with Optivec (http://www.optivec.com) shareware version 5.0. Lisa McShane is acknowledged for helpful suggestions regarding the simulations during the manuscript development. Two reviewers are acknowledged for comments that improved the manuscript. Conflict of Interest: None declared.
For the predictor to have accuracy on the population, it must be as follows:
So any classifier L with actual error rate must satisfy
Now, the dot product of L and μ is , where θ is the angle between μ and L. Hence,
Without loss of generality, suppose μ has the form , where . Then,
Now, . Also, is noncentral chi squared with p degrees of freedom and non-centrality parameter . Therefore, and . Hence,
By the delta method, and , yielding
We use the Kronecker product notation (Hocking, 1992) defined so that . In other words, is formed by replacing each element of A by the matrix B multiplied by that element. Under the model, the joint distribution can be written as follows:
leads to the conditional distribution