Enter Your Search:
Results 1-3 (3)
Go to page number:
Select a Filter Below
BMC Medical Genomics (1)
Briefings in Bioinformatics (1)
JNCI Journal of the National Cancer Institute (1)
Simon, Richard M. (2)
Dobbin, Kevin K (1)
Hayes, Daniel F. (1)
Li, Ming-Chung (1)
Menezes, Supriya (1)
Paik, Soonmyung (1)
Simon, Richard M (1)
Subramanian, Jyothi (1)
Year of Publication
Using cross-validation to evaluate predictive accuracy of survival risk classifiers based on high-dimensional data
Briefings in Bioinformatics
Developments in whole genome biotechnology have stimulated statistical focus on prediction methods. We review here methodology for classifying patients into survival risk groups and for using cross-validation to evaluate such classifications. Measures of discrimination for survival risk models include separation of survival curves, time-dependent ROC curves and Harrell’s concordance index. For high-dimensional data applications, however, computing these measures as re-substitution statistics on the same data used for model development results in highly biased estimates. Most developments in methodology for survival risk modeling with high-dimensional data have utilized separate test data sets for model evaluation. Cross-validation has sometimes been used for optimization of tuning parameters. In many applications, however, the data available are too limited for effective division into training and test sets and consequently authors have often either reported re-substitution statistics or analyzed their data using binary classification methods in order to utilize familiar cross-validation. In this article we have tried to indicate how to utilize cross-validation for the evaluation of survival risk models; specifically how to compute cross-validated estimates of survival distributions for predicted risk groups and how to compute cross-validated time-dependent ROC curves. We have also discussed evaluation of the statistical significance of a survival risk model and evaluation of whether high-dimensional genomic data adds predictive accuracy to a model based on standard covariates alone.
predictive medicine; survival risk classification; cross-validation; gene expression
Use of Archived Specimens in Evaluation of Prognostic and Predictive Biomarkers
Hayes, Daniel F.
JNCI Journal of the National Cancer Institute
The development of tumor biomarkers ready for clinical use is complex. We propose a refined system for biomarker study design, conduct, analysis, and evaluation that incorporates a hierarchal level of evidence scale for tumor marker studies, including those using archived specimens. Although fully prospective randomized clinical trials to evaluate the medical utility of a prognostic or predictive biomarker are the gold standard, such trials are costly, so we discuss more efficient indirect “prospective–retrospective” designs using archived specimens. In particular, we propose new guidelines that stipulate that 1) adequate amounts of archived tissue must be available from enough patients from a prospective trial (which for predictive factors should generally be a randomized design) for analyses to have adequate statistical power and for the patients included in the evaluation to be clearly representative of the patients in the trial; 2) the test should be analytically and preanalytically validated for use with archived tissue; 3) the plan for biomarker evaluation should be completely specified in writing before the performance of biomarker assays on archived tissue and should be focused on evaluation of a single completely defined classifier; and 4) the results from archived specimens should be validated using specimens from one or more similar, but separate, studies.
Optimally splitting cases for training and testing high dimensional classifiers
Dobbin, Kevin K
BMC Medical Genomics
We consider the problem of designing a study to develop a predictive classifier from high dimensional data. A common study design is to split the sample into a training set and an independent test set, where the former is used to develop the classifier and the latter to evaluate its performance. In this paper we address the question of what proportion of the samples should be devoted to the training set. How does this proportion impact the mean squared error (MSE) of the prediction accuracy estimate?
We develop a non-parametric algorithm for determining an optimal splitting proportion that can be applied with a specific dataset and classifier algorithm. We also perform a broad simulation study for the purpose of better understanding the factors that determine the best split proportions and to evaluate commonly used splitting strategies (1/2 training or 2/3 training) under a wide variety of conditions. These methods are based on a decomposition of the MSE into three intuitive component parts.
By applying these approaches to a number of synthetic and real microarray datasets we show that for linear classifiers the optimal proportion depends on the overall number of samples available and the degree of differential expression between the classes. The optimal proportion was found to depend on the full dataset size (n) and classification accuracy - with higher accuracy and smaller n resulting in more assigned to the training set. The commonly used strategy of allocating 2/3rd of cases for training was close to optimal for reasonable sized datasets (n ≥ 100) with strong signals (i.e. 85% or greater full dataset accuracy). In general, we recommend use of our nonparametric resampling approach for determing the optimal split. This approach can be applied to any dataset, using any predictor development method, to determine the best split.
Results 1-3 (3)
Go to page number:
Remove citation from clipboard
Add citation to clipboard
This will clear all selections from your clipboard. Do you wish proceed?
Clipboard is full! Please remove an item and try again.
PubMed Central Canada is a service of the
Canadian Institutes of Health Research
(CIHR) working in partnership with the National Research Council's
Canada Institute for Scientific and Technical Information
in cooperation with the
National Center for Biotechnology Information
U.S. National Library of Medicine
(NCBI/NLM). It includes content provided to the
PubMed Central International archive
by participating publishers.