Diagnostic tests are evaluated on the basis of measures defined conditionally on the true disease status (sensitivity, specificity and receiver operating characteristic, ROC, curve) or conditionally on the test outcome (positive predictive value, PPV, and negative predictive value, NPV). Measures of test performance, such as sensitivity, specificity or ROC curves, provide the type of information that is typically needed for technology assessment and health policy purposes. Measures of predictive value provide the type of information that is typically needed for clinical decision-making, where clinicians and patients decide whether to use a test or how to assess the implications of a test result. The clinical relevance of predictive value information notwithstanding, a large majority of diagnostic test evaluations continue to be designed with a primary focus on measures defined conditionally on the disease status. One of the reasons for this is the theoretical invariance of sensitivity and specificity to disease prevalence. Predictive values, on the contrary, vary across populations with different disease prevalence, making comparisons of diagnostic tests difficult.
The invariance to disease prevalence of quantities such as sensitivity and specificity is predicated on the assumption that the disease status is the only variable affecting test outcome. Although this simplification is useful as a building block of the theory of diagnostic test evaluation, in reality the test outcome may depend on a number of other characteristics beyond disease status (Moons & Harrel 2003
). For example, in a clinical study where electrocardiographic stress test was used to diagnose coronary artery disease, the sensitivity and specificity of this stress test result varied substantially according to gender, relative workload as well as the number of diseased vessels (Moons et al. 1997
). Moreover, variations in sensitivity, specificity and ROC curves are routinely observed in meta-analysis of diagnostic tests (Irwig et al. 1994
; Rutter & Gatsonis 2001
) and in studies of the performance of test interpreters (Ishwaran & Gatsonis 2000
; Beam et al. 2003
The available statistical methodology for the study of the predictive value of tests is less extensive than the corresponding methodology for measures defined conditionally on disease status (Bennett 1985
; Pepe 2003
). Copas (1999)
proposed to use the logit rank plot as a summary of the effectiveness of risk scores. Summary measures of the resulting predictiveness curve were subsequently considered (Bura & Gastwirth 2001
) and a thorough study of inference based on the full predictiveness curve was presented (Huang et al. 2007
). Leisenring et al. (2000)
discussed a model-based approach to the comparison of the predictive values of binary tests for paired designs. In this approach, a marginal regression modelling framework was used with disease status as the response variable and test indicator as an explanatory variable.
The effect of the threshold value used to declare a positive test result is a fundamental aspect of our understanding of the performance of diagnostic tests. This threshold is the conceptual basis for the well-known trade-off between sensitivity and specificity of a test that gives rise to the ROC curve. A significant body of statistical literature has discussed models with implicit or explicit thresholds for test positivity (Hanley & McNeil 1982
; Hanley 1989
; Hanley 1998
; Pepe 2003
). Although it is clear that the PPV and NPV of a test are also functions of the threshold for test positivity, the effect of this dependence has not been studied extensively. This dependence induces a close relation between the two quantities and implies that if the threshold is moved both
will be affected. It follows that a complete characterization of the predictive power of a test requires the study of both quantities as a pair. Moskowitz & Pepe (2004)
proposed a graphical method and a regression framework to estimate and compare predictive values of continuous prognostic factors as a function of the positivity threshold. In that work, the PPV and NPV are quantified and assessed separately. To the best of our knowledge, the joint
evaluation of the two quantities has not been discussed in the literature. Although both quantities may not be of equal interest in a given practical setting, it is rarely the case that interest lies exclusively in only one of them. However, the joint behaviour of the two quantities cannot necessarily be inferred from their marginal behaviour that would be assessed by a separate analysis of each.
In this paper, we undertake a systematic study of the effect of the positivity threshold on the pair of PPV and NPV of tests. Our emphasis is on the study of the interplay between the two types of predictive values of a test as the positivity criterion varies and on the development of summaries of a test's possible pairs of predictive values. In particular, we define the predictive receiver operating characteristic (PROC) curve that shows all possible pairs of the PPV and NPV of a test as the threshold varies. We study the geometric properties of the PROC curve, discuss methods for estimating the curve for continuous and ordinal valued tests and propose summary measures for the predictive performance of tests. We also formulate and discuss regression models for the estimation of the effects of covariates.
This paper is organized as follows. In §2
we define the PROC curve that can apply to tests with continuous or ordinal categorical outcomes. The geometric patterns and other properties of the PROC curve are discussed in §3
. Details of mathematical derivations of PROC curve properties are presented in appendix B of the electronic supplementary material. Section 4
presents the estimation of the PROC curve. We discuss an indirect approach through ROC curve estimation for tests in general. We also propose a direct approach that jointly estimates the PPV and NPV for tests with ordinal outcomes. In §5
we describe methods for evaluating a test's predictive performance and comparing tests using the PROC curve. Illustrative examples with both continuous and ordinal test data are presented in §6
, including assessing the ability of standardized uptake value based on lean body mass (SUV-lean; continuous test data) to predict axillary node involvement in women diagnosed with breast cancer and comparing predictive accuracy between digital and screen-film mammography (ordinal test data) for breast-cancer screening. Section 7
summarizes our conclusions.