Developments in whole genome biotechnology have stimulated statistical focus on development of methodology for predictive medicine in settings where the number of candidate variables is large relative to the number of cases. For applications in oncology, there is often interest in classifying patients into survival risk groups. Measures of discrimination for survival risk models include separation of survival curves, time-dependent ROC curves and Harrell’s concordance index [

26]. For high-dimensional data applications, however, computing these measures as re-substitution statistics on the same data used for model development results in highly biased estimates. Most developments in methodology for survival risk modeling with high-dimensional data have utilized separate test data sets for model evaluation. For example, Li and Gui [

10] utilized the time-dependent ROC curve for survival modeling in the context of a separate test set. Cross-validation has sometimes been used for optimization of tuning parameters but rarely for the evaluation of survival risk models. An exception is the study by van Houwelingen

*et al*. [

9] that used cross-validation to evaluate

*L*2 penalized proportional hazards survival risk models. In many applications, however, the data available is too limited for effective division into training and test sets and consequently authors have often either reported re-substitution statistics or analyzed their data using binary classification methods in order to utilize familiar cross-validation. In this article we have tried to indicate how to utilize cross-validation for the evaluation of survival risk models; specifically how to compute cross-validated estimates of survival distributions for predicted risk groups and how to compute cross-validated time-dependent ROC curves. We have also discussed evaluation of the statistical significance of a survival risk model and evaluation of whether high-dimensional genomic data adds to the predictiveness of a model based on standard covariates.

In this article we have emphasized proper evaluation of models for classifying patients based on survival risk. Using cross-validated time dependent ROC curves, these methods can be evaluated without grouping patients into fixed risk groups. Schumacher

*et al*. [

27] have developed methods for the evaluation of models for prediction of survival functions of individual patients. For each time

*t*, the Brier score for patient

*i* is [

*Y*_{i}(

*t*)

−

*r*(

*t*,

*x*_{i})]

^{2}, where

is an indicator function whether patient i survives beyond time

*t* and

denotes the predicted probability of surviving beyond time

*t* for a patient with covariate vector

*x*_{i}. Schumacher

*et al*. show how to adapt the Brier score to censored data and utilize an out-of-box 0.632 bootstrap cross-validation estimate of the Brier score as a function of t for a given data set. Binder and Schumacher [

24] have used the Brier score to evaluate high-dimensional survival risk models built with mandatory covariates and the permutation tests described here could be applied to the Brier score either at a fixed landmark time t or averaged over times.

There are some advantages to partitioning a data set into separate training and test sets when the numbers of patients and events are large. Such partitioning enables model development to be non-algorithmic, taking into account biological considerations in selecting genes for inclusion. It also enables multiple analysts to develop models on a common training set in a manner completely blinded to the test set. Some commentators recommend the use of a separate test set because cross-validation is so often used improperly without re-selecting genes to be included in the model within each loop of the procedure [

1,

28]. In many cases, however, proper cross-validation provides a more efficient use of the data than does sample splitting. If the training set is too small, then the model developed on the training set may be substantially poorer than the one developed on the full data set and hence the accuracy measured on the test set will provide a biased estimate of the prediction accuracy for the model based on the full data set [

3].

Proper complete cross-validation avoids optimistic bias in estimation of survival risk discrimination for the survival risk model developed on the full data set. Cross-validated estimates of survival risk discrimination can be pessimistically biased if the number of folds

*K* is too small for the number of events, and the variance of the cross-validated risk group survival curves or time-dependent ROC curves will be large, particularly when

*K* is large and the number of events is small. For example, for the null simulations of , there are several cases in which the cross-validated Kaplan–Meier curve for the low-risk group is below that for the high-risk group. This is due to the large variance of the estimates. This can also be seen in where the separation in the estimated survival curves for the combined model is less than that for the model containing only clinical covariates. This large variance is properly accounted for, however, in the permutation tests for evaluating whether the separation between cross-validated survival curves is statistically significant and whether the separation for the combined model is better than for the clinical covariate only model. Molinaro

*et al*. [

3] have studied the bias-variance tradeoff for estimating classification error for a variety of complete re-sampling methods including leave-one-out cross-validation,

*K*-fold cross-validation, replicated

*K*-fold cross-validation, sample splitting and the 0.632 bootstrap. The relative merits of the different methods depended on sample size, separation of the classes and type of classifier used. For small sample sizes of fewer than 50 cases, they recommended use of leave-one-out cross-validation to minimize mean squared error of the estimate of prediction error in use of the classifier developed in the full sample for future observations. Subramanian and Simon [

29] have extended this evaluation to survival risk models using area under the time-dependent ROC curve as the measure of prediction accuracy. With survival modeling the relative merits of the various re-sampling strategies depended on number of events in the data set, the prediction strength of the variables for the true model and the modeling strategy. They recommended use of 5- or 10-fold cross-validation for a wide range of conditions. They indicated that although the leave-one-out cross-validation was nearly unbiased, its large variance too often led to misleadingly optimistic estimates of prediction accuracy. Replicated

*K*-fold cross-validation was found by Molinaro

*et al*. [

3] to provide small reductions in the variance of prediction error estimates somewhat for binary classification problems. It increases the complexity of identifying risk groups in survival modeling, however. Although it is offered in the BRB-ArrayTools for the validation of class prediction, it is not offered for validation of survival risk prediction.

In summary, we believe that cross-validation methodology, if employed correctly, can be useful for the evaluation of survival risk modeling and should be utilized more widely. It can provide a more efficient use of data for model development and validation than does fixed sample splitting. In data sets with few events, however, the survival risk models developed may be much poorer than could be developed with more data and the cross-validated Kaplan–Meier curves of risk groups and time dependent ROC curves will be imprecise [

2]. Although the cross-validation approaches described here are broadly useful, they are not a good substitute for having a substantially larger sample size when that is possible. Often, however, larger studies come later when initial results are felt promising. The ‘promise’ of initial results should be evaluated unbiasedly, however and this is often not the case. It should also be recognized that both cross-validation and sample splitting represent internal validation and do not reflect many of the sources of variability present in applying a predictive classifier in broad clinical practice outside of research conditions in which assaying of samples are performed in a single laboratory. Nevertheless, the development of effective diagnostics is a multi-stage process, starting with developmental studies in which efficient methods of internal validation play an important role.

Key Points- Survival risk groups should generally be developed directly using the survival data without reducing survival times to binary categories.
- Cross-validation methods are available for computing cross-validated survival curves and cross-validated ROC curves. These methods, if used properly, are more efficient than splitting a small data set into training and testing subsets.
- To use cross-validation properly, complete re-development of the survival risk model from scratch is required for each loop of the cross-validation process. This means that any variable selection or tuning parameter optimization should be repeated within each loop of the cross-validation.
- The cross-validated estimate of survival discrimination is an almost unbiased estimate of the survival risk group discrimination expected from classifying similar future patients using the risk groups obtained from applying the survival risk group development algorithm to the full data set.
- A permutation based significance test of the null hypothesis that survival risk discrimination is null can be computed based on the cross-validated log-rank statistic or the area under the cross-validated ROC curve. Similarly, one may test whether high-dimensional genomic variables add survival discrimination to standard clinical and histopathologic variables.
- Many of the tools for developing survival risk models and for evaluating such models using cross-validation are available in the BRB-ArrayTools software package.