antibody; diagnosis; risk
For comparing the performance of a baseline risk prediction model with one that includes an additional predictor, a risk reclassification analysis strategy has been proposed. The first step is to cross-classify risks calculated according to the 2 models for all study subjects. Summary measures including the percentage of reclassification and the percentage of correct reclassification are calculated, along with 2 reclassification calibration statistics. The author shows that interpretations of the proposed summary measures and P values are problematic. The author's recommendation is to display the reclassification table, because it shows interesting information, but to use alternative methods for summarizing and comparing model performance. The Net Reclassification Index has been suggested as one alternative method. The author argues for reporting components of the Net Reclassification Index because they are more clinically relevant than is the single numerical summary measure.
biological markers; diagnosis; epidemiologic methods; prognosis; risk model
Markers for treatment selection are being developed in many areas of medicine. Technological advances are rapidly producing an abundance of candidates for study. Clinicians hope to use these markers to identify which individuals will benefit from a given treatment, with the goal of maximizing good outcomes and minimizing side effects, treatment burden, and medical costs.
It is essential that we have appropriate methods for evaluating treatment selection markers, in order to make informed decisions regarding marker advancement and, ultimately, clinical application. However, existing statistical methods for evaluating treatment selection markers are largely inadequate. This paper proposes several novel statistical measures of marker performance aimed at addressing key questions in marker evaluation: 1) Does the marker help patients choose amongst treatment options?; 2) How should treatment decisions be made based on a continuous marker measurement?; 3) What is the impact on the population of using the marker to select treatment?; and 4) What proportion of patients will have different treatment recommendations following marker measurement? The proposed approach is contrasted with existing methods for marker evaluation, including assessing a marker’s prognostic value, evaluating treatment effects in a subset of the population who are marker-positive, and testing for a statistical interaction between marker value and treatment. The approach is illustrated in the context of choosing adjuvant chemotherapy treatment for women with estrogen-receptor positive and node-positive breast cancer. The results have important implications for the design of marker evaluation studies, and can serve as the basis for further development of standards for assessing treatment selection markers.
The diagnostic likelihood ratio function, DLR, is a statistical measure used to evaluate risk prediction markers. The goal of this paper is to develop new methods to estimate the DLR function. Furthermore, we show how risk prediction markers can be compared using rank-invariant DLR functions. Various estimators are proposed that accommodate cohort or case–control study designs. Performances of the estimators are compared using simulation studies. The methods are illustrated by comparing a lung function measure and a nutritional status measure for predicting subsequent onset of major pulmonary infection in children suffering from cystic fibrosis. For continuous markers, the DLR function is mathematically related to the slope of the receiver operating characteristic (ROC) curve, an entity used to evaluate diagnostic markers. We show that our methodology can be used to estimate the slope of the ROC curve and illustrate use of the estimated ROC derivative in variance and sample size calculations for a diagnostic biomarker study.
Biomarker; density estimation; diagnosis; logistic regression; rank invariant; risk prediction; ROC–GLM
Statistical evaluation of medical imaging tests used for diagnostic and prognostic purposes often employ receiver operating characteristic (ROC) curves. Two methods for ROC analysis are popular. The ordinal regression method is the standard approach used when evaluating tests with ordinal values. The direct ROC modeling method is a more recently developed approach that has been motivated by applications to tests with continuous values, such as biomarkers.
In this paper, we compare the methods in terms of model formulations, interpretations of estimated parameters, the ranges of scientific questions that can be addressed with them, their computational algorithms and the efficiencies with which they use data.
We show that a strong relationship exists between the methods by demonstrating that they fit the same models when only a single test is evaluated. The ordinal regression models are typically alternative parameterizations of the direct ROC models and vice-versa. The direct method has two major advantages over the ordinal regression method: (i) estimated parameters relate directly to ROC curves. This facilitates interpretations of covariate effects on ROC performance; and (ii) comparisons between tests can be done directly in this framework. Comparisons can be made while accommodating covariate effects and comparisons can be made even between tests that have values on different scales, such as between a continuous biomarker test and an ordinal valued imaging test. The ordinal regression method provides slightly more precise parameter estimates from data in our simulated data models.
While the ordinal regression method is slightly more efficient, the direct ROC modeling method has important advantages in regards to interpretation and it offers a framework to address a broader range of scientific questions including the facility to compare tests.
comparisons; covariates; diagnostic test; markers; ordinal regression; percentile values
The predictiveness curve is a graphical tool that characterizes the population distribution of Risk(Y) = P(D = 1|Y), where D denotes a binary outcome such as occurrence of an event within a specified time period and Y denotes predictors. A wider distribution of Risk(Y) indicates better performance of a risk model in the sense that making treatment recommendations is easier for more subjects. Decisions are more straightforward when a subject's risk is deemed to be high or low. Methods have been developed to estimate predictiveness curves from cohort studies. However early phase studies to evaluate novel risk prediction markers typically employ case-control designs. Here we present semiparametric and nonparametric methods for evaluating a continuous risk prediction marker that accommodate case-control data. Small sample properties are investigated through simulation studies. The semiparametric methods are substantially more efficient than their nonparametric counterparts under a correctly specified model. We generalize them to settings where multiple prediction markers are involved. Applications to prostate cancer risk prediction markers illustrate methods for comparing the risk prediction capacities of markers and for evaluating the increment in performance gained by adding a marker to a baseline risk model. We propose a modified Hosmer-Lemeshow test for case-control study data to assess calibration of the risk model that is a natural complement to this graphical tool.
biomarker; case-control study; classification; Hosmer-Lemeshow test; predictiveness curve; risk; ROC curve
In many clinical settings, statistical models are being developed for predicting risk of disease or other adverse event. These models are intended to help patients and physicians make informed decisions. A new approach to assessing the value of adding a new marker to a risk prediction model, called the risk stratification approach, was recently proposed by Cook and colleagues (1,2). This involves cross-tabulating risk predictions on the basis of models with and without the new marker, and has been widely adopted in the literature. We argue that important information with regard to three important model validation criteria can be extracted from risk stratification tables: 1) model fit or calibration; 2) capacity for risk stratification; and 3) accuracy of classifications based on risk. However, we describe how the information contained in the tables must be interpreted carefully, and caution against common misuses of the method. The concepts are illustrated using data from a recently published study of a breast cancer risk prediction model by Tice et al. (3).
Consider a continuous marker for predicting a binary outcome. For example, serum concentration of prostate specific antigen (PSA) may be used to calculate the risk of finding prostate cancer in a biopsy. In this paper we argue that the predictive capacity of a marker has to do with the population distribution of risk given the marker and suggest a graphical tool, the predictiveness curve, that displays this distribution. The display provides a common meaningful scale for comparing markers that may not be comparable on their original scales. Some existing measures of predictiveness are shown to be summary indices derived from the predictiveness curve. We develop methods for making inference about the predictiveness curve, for making pointwise comparisons between two curves and for evaluating covariate effects. Applications to risk prediction markers in cancer and cystic fibrosis are discussed.
risk; classification; explained variation; biomarker; ROC curve; prediction
Advances in biotechnology have raised expectations that biomarkers, including genetic profiles, will yield information to accurately predict outcomes for individuals. However, results to date have been disappointing. In addition, statistical methods to quantify the predictive information in markers have not been standardized.
We discuss statistical techniques to summarize predictive information including risk distribution curves and measures derived from them that relate to decision making. Attributes of these measures are contrasted with alternatives such as receiver operating characteristic curves, R-squared, percent reclassification and net reclassification index. Data are generated from simple models of risk conferred by genetic profiles for individuals in a population. Statistical techniques are illustrated and the risk prediction capacities of different risk models are quantified.
Risk distribution curves are most informative and relevant to clinical practice. They show proportions of subjects classified into clinically relevant risk categories. In a population in which 10% have the outcome event and subjects are categorized as high risk if their risk exceeds 20%, we found to identify as high risk more than half of those destined to have an event, either 150 genes each with odds ratio of 1.5 or 250 genes each with odds ratio of 1.25 was required when the minor allele frequencies are 10%. We show that conclusions based on ROC curves may not be the same as conclusions based on risk distribution curves.
Many highly predictive genes will be required in order to identify substantial numbers of subjects at high risk.
biomarkers; classification; discrimination; prediction; statistical methods
The performance of a well-calibrated risk model for a binary disease outcome can be characterized by the population distribution of risk and displayed with the predictiveness curve. Better performance is characterized by a wider distribution of risk, since this corresponds to better risk stratification in the sense that more subjects are identified at low and high risk for the disease outcome. Although methods have been developed to estimate predictiveness curves from cohort studies, most studies to evaluate novel risk prediction markers employ case-control designs. Here we develop semiparametric methods that accommodate case-control data. The semiparametric methods are flexible, and naturally generalize methods previously developed for cohort data. Applications to prostate cancer risk prediction markers illustrate the methods.
Biased sampling; Biomarker; Case-control; Predictiveness curve; Risk prediction; Semiparametric method
The predictive capacity of a marker in a population can be described using the population distribution of risk (Huang et al. 2007; Pepe et al. 2008a; Stern 2008). Virtually all standard statistical summaries of predictability and discrimination can be derived from it (Gail and Pfeiffer 2005). The goal of this paper is to develop methods for making inference about risk prediction markers using summary measures derived from the risk distribution. We describe some new clinically motivated summary measures and give new interpretations to some existing statistical measures. Methods for estimating these summary measures are described along with distribution theory that facilitates construction of confidence intervals from data. We show how markers and, more generally, how risk prediction models, can be compared using clinically relevant measures of predictability. The methods are illustrated by application to markers of lung function and nutritional status for predicting subsequent onset of major pulmonary infection in children suffering from cystic fibrosis. Simulation studies show that methods for inference are valid for use in practice.
Recent scientific and technological innovations have produced an abundance of potential markers that are being investigated for their use in disease screening and diagnosis. In evaluating these markers, it is often necessary to account for covariates associated with the marker of interest. Covariates may include subject characteristics, expertise of the test operator, test procedures or aspects of specimen handling. In this paper, we propose the covariate-adjusted receiver operating characteristic curve, a measure of covariate-adjusted classification accuracy. Nonparametric and semiparametric estimators are proposed, asymptotic distribution theory is provided and finite sample performance is investigated. For illustration we characterize the age-adjusted discriminatory accuracy of prostate-specific antigen as a biomarker for prostate cancer.
Classification accuracy; Covariate effect; Receiver operating characteristic curve; Sensitivity; Specificity
The classification accuracy of a continuous marker is typically evaluated with the receiver operating characteristic (ROC) curve. In this paper, we study an alternative conceptual framework, the “percentile value.” In this framework, the controls only provide a reference distribution to standardize the marker. The analysis proceeds by analyzing the standardized marker in cases. The approach is shown to be equivalent to ROC analysis. Advantages are that it provides a framework familiar to a broad spectrum of biostatisticians and it opens up avenues for new statistical techniques in biomarker evaluation. We develop several new procedures based on this framework for comparing biomarkers and biomarker performance in different populations. We develop methods that adjust such comparisons for covariates. The methods are illustrated on data from 2 cancer biomarker studies.
Biomarker; Classification; Covariate adjustment; Percentile value; ROC; Standardization
The receiver operating characteristic (ROC) curve displays the capacity of a marker or diagnostic test to discriminate between two groups of subjects, cases versus controls. We present a comprehensive suite of Stata commands for performing ROC analysis. Non-parametric, semiparametric and parametric estimators are calculated. Comparisons between curves are based on the area or partial area under the ROC curve. Alternatively pointwise comparisons between ROC curves or inverse ROC curves can be made. Options to adjust these analyses for covariates, and to perform ROC regression are described in a companion article. We use a unified framework by representing the ROC curve as the distribution of the marker in cases after standardizing it to the control reference distribution.
Biomarkers that can be used in combination with established screening tests to reduce false positive rates are in considerable demand. In this article, we present methods for evaluating the diagnostic performance of combination tests that require positivity on a biomarker test in addition to a standard screening test. These methods rely on relative true and false positive rates to measure the loss in sensitivity and gain in specificity associated with the combination relative to the standard test. Inference about the relative rates follows from noting their interpretation as conditional probabilities. These methods are extended to evaluate combinations with continuous biomarker tests by introducing a new statistical entity, the relative receiver operating characteristic (rROC) curve. The rROC curve plots the relative true positive rate versus the relative false positive rate as the biomarker threshold for positivity varies. Inference can be made by applying existing ROC methodology. We illustrate the methods with two examples: a breast cancer biomarker study proposed by the Early Detection Research Network (EDRN) and a prostate cancer case-control study examining the ability of free prostate-specific antigen (PSA) to improve the specificity of the standard PSA test.
Diagnostic tests; Relative accuracy; ROC curve; Specificity; Study design
Consider a set of baseline predictors X to predict a binary outcome D and let Y be a novel marker or predictor. This paper is concerned with evaluating the performance of the augmented risk model P(D = 1|Y,X) compared with the baseline model P(D = 1|X). The diagnostic likelihood ratio, DLRX(y), quantifies the change in risk obtained with knowledge of Y = y for a subject with baseline risk factors X. The notion is commonly used in clinical medicine to quantify the increment in risk prediction due to Y. It is contrasted here with the notion of covariate-adjusted effect of Y in the augmented risk model. We also propose methods for making inference about DLRX(y). Case–control study designs are accommodated. The methods provide a mechanism to investigate if the predictive information in Y varies with baseline covariates. In addition, we show that when combined with a baseline risk model and information about the population distribution of Y given X, covariate-specific predictiveness curves can be estimated. These curves are useful to an individual in deciding if ascertainment of Y is likely to be informative or not for him. We illustrate with data from 2 studies: one is a study of the performance of hearing screening tests for infants, and the other concerns the value of serum creatinine in diagnosing renal artery stenosis.
Biomarker; Classification; Diagnostic likelihood ratio; Diagnostic test; Logistic regression; Posterior probability
Classification accuracy is the ability of a marker or diagnostic test to discriminate between two groups of individuals, cases and controls, and is commonly summarized using the receiver operating characteristic (ROC) curve. In studies of classification accuracy, there are often covariates that should be incorporated into the ROC analysis. We describe three different ways of using covariate information. For factors that affect marker observations among controls, we present a method for covariate adjustment. For factors that affect discrimination (i.e. the ROC curve), we describe methods for modelling the ROC curve as a function of covariates. Finally, for factors that contribute to discrimination, we propose combining the marker and covariate information, and ask how much discriminatory accuracy improves with the addition of the marker to the covariates (incremental value). These methods follow naturally when representing the ROC curve as a summary of the distribution of case marker observations, standardized with respect to the control distribution.
Development of a disease screening biomarker involves several phases. In phase 2 its sensitivity and specificity is compared with established thresholds for minimally acceptable performance. Since we anticipate that most candidate markers will not prove to be useful and availability of specimens and funding is limited, early termination of a study is appropriate if accumulating data indicate that the marker is inadequate. Yet, for markers that complete phase 2, we seek estimates of sensitivity and specificity to proceed with the design of subsequent phase 3 studies.
We suggest early stopping criteria and estimation procedures that adjust for bias caused by the early termination option. An important aspect of our approach is to focus on properties of estimates conditional on reaching full study enrollment. We propose the conditional-UMVUE and contrast it with other estimates, including naïve estimators, the well studied unconditional-UMVUE and the mean and median Whitehead adjusted estimators. The conditional-UMVUE appears to be a very good choice.
In a prospective cohort study, information on clinical parameters, tests and molecular markers is often collected. Such information is useful to predict patient prognosis and to select patients for targeted therapy. We propose a new graphical approach, the positive predictive value (PPV) curve, to quantify the predictive accuracy of prognostic markers measured on a continuous scale with censored failure time outcome. The proposed method highlights the need to consider both predictive values and the marker distribution in the population when evaluating a marker, and it provides a common scale for comparing different markers. We consider both semiparametric and nonparametric based estimating procedures. In addition, we provide asymptotic distribution theory and resampling based procedures for making statistical inference. We illustrate our approach with numerical studies and datasets from the Seattle Heart Failure Study.
Prognostic accuracy; Positive predictive value; Survival analysis
There are two popular statistical approaches to biomarker evaluation. One models the risk of disease (or disease outcome) with, for example, logistic regression. A marker is considered useful if it has a strong effect on risk. The second evaluates classification performance by use of measures such as sensitivity, specificity, predictive values, and receiver operating characteristic curves. There is controversy about which approach is more appropriate. Moreover, the two approaches can give contradictory results on the same data. The authors present a new graphic, the predictiveness curve, which complements the risk modeling approach. It assesses the usefulness of a risk model when applied to the population. Although the predictiveness curve relates to classification performance measures, it also displays essential information about risk that is not displayed by the receiver operating characteristic curve. The authors propose that the predictiveness and classification performance of a marker, displayed together in an integrated plot, provide a comprehensive and cohesive assessment of a risk marker or model. The methods are demonstrated with data on prostate-specific antigen and risk factors from the Prostate Cancer Prevention Trial, 1993–2003.
biological markers; classification analysis; diagnostic tests, routine; epidemiologic methods; predictive value of tests; prostate-specific antigen; risk assessment; risk model
Research methods for biomarker evaluation lag behind those for evaluating therapeutic treatments. Although a phased approach to development of biomarkers exists and guidelines are available for reporting study results, a coherent and comprehensive set of guidelines for study design has not been delineated. We describe a nested case–control study design that involves prospective collection of specimens before outcome ascertainment from a study cohort that is relevant to the clinical application. The biomarker is assayed in a blinded fashion on specimens from randomly selected case patients and control subjects in the study cohort. We separately describe aspects of the design that relate to the clinical context, biomarker performance criteria, the biomarker test, and study size. The design can be applied to studies of biomarkers intended for use in disease diagnosis, screening, or prognosis. Common biases that pervade the biomarker research literature would be eliminated if these rigorous standards were followed.
Consider a gene expression array study comparing two groups of subjects where the goal is to explore a large number of genes in order to select for further investigation a subset that appear to be differently expressed. There has been much statistical research into the development of formal methods for designating genes as differentially expressed. These procedures control error rates such as the false detection rate or family wise error rate. We contend however that other statistical considerations are also relevant to the task of gene selection. These include the extent of differential expression and the strength of evidence for differential expression at a gene. Using real and simulated data we first demonstrate that a proper exploratory analysis should evaluate these aspects as well as decision rules that control error rates. We propose a new measure called the mp-value that quantifies strength of evidence for differential expression. The mp-values are calculated with a resampling based algorithm taking into account the multiplicity and dependence encountered in microarray data. In contrast to traditional p-values our mp-values do not depend on specification of a decision rule for their definition. They are simply descriptive in nature. We contrast the mp-values with multiple testing p-values in the context of data from a breast cancer prognosis study and from a simulation model.