|Home | About | Journals | Submit | Contact Us | Français|
Classification accuracy is the ability of a marker or diagnostic test to discriminate between two groups of individuals, cases and controls, and is commonly summarized using the receiver operating characteristic (ROC) curve. In studies of classification accuracy, there are often covariates that should be incorporated into the ROC analysis. We describe three different ways of using covariate information. For factors that affect marker observations among controls, we present a method for covariate adjustment. For factors that affect discrimination (i.e. the ROC curve), we describe methods for modelling the ROC curve as a function of covariates. Finally, for factors that contribute to discrimination, we propose combining the marker and covariate information, and ask how much discriminatory accuracy improves with the addition of the marker to the covariates (incremental value). These methods follow naturally when representing the ROC curve as a summary of the distribution of case marker observations, standardized with respect to the control distribution.
The classification accuracy of a marker (Y) is most commonly described by the receiver operating characteristic (ROC) curve, a plot of the true positive rate (TPR) versus the false positive rate (FPR) for the set of rules which classify an individual as “test-positive” if Y ≥ c, where the threshold c is varied over all possible values (Pepe et al., 2001; Baker, 2003). Equivalently, the ROC curve can be represented as the cumulative distribution function (CDF) of the case marker observations, standardized with respect to the control distribution (Pepe and Cai, 2004; Pepe and Longton, 2005). The standardized marker observations, or percentile values, are written as pvD = F(YD), where F is the right-continuous CDF of Y among controls and YD denotes a case marker observation. The ROC curve at a FPR of f is
In many settings, covariates should be incorporated into the ROC analysis. First, there are covariates which impact the marker distribution among controls. For example, “center effects” in multi-center studies may affect marker observations. In Section 2, we describe methods for adjusting the ROC curve for such covariates. The associated Stata programs are called roccurve and comproc. Other covariates may affect the inherent discriminatory accuracy of the marker (i.e. the ROC curve). For example, disease severity often impacts marker accuracy, with less severe cases being more difficult to distinguish from controls. In Section 3 we describe an ROC regression method which allows the ROC curve to depend on covariates. The associated Stata program is called rocreg. Finally, there are covariates which contribute to discrimination. For example, baseline risk factors for disease provide some ability to discriminate between cases and controls. A common question is how much discriminatory accuracy the marker adds to the known classifiers (i.e. incremental value). In Section 4 we describe methods for evaluating incremental value.
This paper is a companion to another article in this journal (Pepe et al., 2007) which describes the use of the programs roccurve and comproc for estimating and comparing ROC curves without incorporating covariate information.
Consider a covariate, Z, which affects the distribution of the marker among controls. Figure 1 shows hypothetical data for a continuous marker Y, binary outcome D, and binary covariate Z. The data can be downloaded from the Diagnostic and Biomarker Statistical Center (DABS) website (http://www.fhcrc.org/labs/pepe/dabs). Suppose for concreteness that Z is an indicator of study center. Observe that marker observations among controls (D = 0) tend to be higher in center 1 as compared with center 0, but that the inherent accuracy of the marker (the ROC curve) is the same in the two centers. Consider the pooled ROC curve for Y which combines all case observations together and all control observations together, regardless of study center. Observe in Figure 1 that when the proportion of cases varies across centers (scenario 1), the pooled ROC curve for Y is overly optimistic relative to the ROC curve for Y in each center. Even when Z is independent of the outcome (i.e. the proportion of cases is held constant across centers; scenario 2), the pooled ROC curve is biased; this time it is attenuated with respect to the center-specific ROC curve. This suggests that covariates which impact marker observations among controls should be statistically adjusted in the ROC analysis.
We propose a covariate-adjusted measure of classification accuracy called the covariate-adjusted ROC curve, or ROC (Janes and Pepe, 2006; Janes and Pepe, 2007). Conceptually, this is a stratified measure of marker performance. It is defined as
where pvDZ = FZ(YDZ) represents the case observation with covariate value Z(YDZ) standardized with respect to the control population with the same value of Z. When the performance of the marker is the same across populations with different values of Z, as in Figure 1, the ROC is the common covariate-specific ROC curve. More generally, it is a weighted average of covariate-specific ROC curves (Janes and Pepe, 2006). Equivalently, the ROC is the ROC curve for Y when Z-specific thresholds are used for classification. The thresholds, cZ, are chosen to ensure that the covariate-specific FPR, FPRZ(cZ), is common across levels of Z.
Estimation of the ROC proceeds in two steps: 1) estimate FZ, the distribution of the marker in controls as a function of Z. For each subject i, calculate the case percentile value: pvDZi = FZi(YDZi). 2) Estimate the CDF of the case percentile values.
Estimating FZ begins with specifying how Z acts on the distribution of Y among controls. For example, a linear model could be specified,
The random error, ε, could be assumed to be normally distributed, ε ~ N(0, σ2), which would lead to case percentile values
where Φ is the standard normal CDF and 0, 1, and are estimates from the linear model. Alternatively, the error distribution could be estimated empirically using the residuals from the linear model as in Heagerty and Pepe (1999). This would lead to percentile values
In addition to allowing Z to act linearly on marker observations among controls, the roccurve command allows for stratifying on Z. Here again the distribution of Y among controls conditional on Z can be estimated empirically or by assuming a normal distribution.
Once the percentile values have been calculated, their CDF must be estimated. This estimation step is described in more detail in the companion paper (Pepe et al., 2007). Briefly, the CDF can be estimated empirically, or a parametric distribution can be assumed. The roccurve program allows parametric forms
where g = Φ is the standard normal CDF or g(·) = exp(·)/(1+exp(·)) is the logistic function. These forms give rise to binormal (Dorfman and Alf, 1969) and bilogistic (Ogilvie and Creelman, 1968) ROC curves.
In order to fit the ROC model, a discrete set of FPR points, f1, …, fnp is chosen. These points may span the interval (0, 1) or a subinterval of interest, (a, b). For each case observation, a set of np records is created. The kth record for the ith subject includes the binary outcome and covariate g−1(fk). A binary regression model with link g, outcome U, and covariate g−1(f) provides estimates of (α0, α1) (Alonzo and Pepe, 2002).
We bootstrap the data to obtain standard errors for the estimated ROC. The data should be resampled according to the design of the study; for a case-control study this means resampling separately within case and control strata. If the data are clustered, the clusters should be the resampling units.
Consider as an example, data from a neonatal audiology study designed to evaluate the accuracy with which three audiology tests identify hearing impairment in newborns (Norton et al., 2000). The data can be downloaded from the DABS website or loaded directly into Stata using
Note that test results for hearing-unimpaired ears may depend on the age and gender of the child. Figure 2 shows the estimated age- and gender-adjusted ROC curves for the marker DPOAE. Several estimation options are shown. The first estimator assumes a linear model for marker measurements among controls,
where the error distribution is estimated empirically. The CDF of the estimated placement values,
is estimated empirically. The second estimator adds the assumption that ε is normally distributed, and the third estimator additionally assumes that the ROC curve is binormal. Clustered bootstrapping is used for inference to account for correlation amongst observations (ears) for the same individual. Observe that the ROC fit is somewhat sensitive to the normality assumption at the high end of the marker distribution. We next describe how to estimate these curves using the roccurve command.
The syntax for the roccurve command is
roccurve disease_var test_varlist [if] [in] [, options]
where disease_var is the name of the binary outcome, D = 1 for a case and D = 0 for a control, and test_varlist is the list of markers.
The general roccurve options are described in detail in the companion paper (Pepe et al., 2007). Here we focus on options that relate to covariate adjustment.
The covariates to be used for adjustment are specified using the option adjcov(varlist). The option adjmodel(model) specifies how the covariates are to be used for adjustment; the default is stratified, where the control marker distribution is stratified on the covariates. The other option is linear; here the covariates are assumed to act linearly on the control marker distribution.
Standardized marker values are calculated according to specification in the option pvcmethod(method). Options include empirical (the default), where the control marker distribution is estimated empirically conditional on the covariates, and normal, where the control marker is assumed to have a normal distribution conditional on the covariates.
rocmethod(method) specifies whether the empirical or parametric model for the ROC curve is used. The link option is required for a parametric ROC model; a binormal model is fit with link( probit) and a bilogistic model with link( logistic). In the case of a parametric ROC model, the option interval(a b np) can be used to specify that the model is fit at np points over the restricted FPR interval (a, b).
Bootstrapping is used for inference. By default the data are resampled conditional on the binary outcome. The option noccsamp specifies that data be resampled without regard to the outcome. The option nostsamp specifies that sampling be done without regard to covariate strata; by default, when covariates are used for stratification, bootstrap samples are drawn from within covariate strata. The cluster(varlist) option can be used to bootstrap clustered data.
Other options include: tiecorr, which corrects for ties between case and control observations; various plot options; and options for saving the estimated TPRs, FPRs, and percentile values as new variables. These are all discussed in more detail in the companion article (Pepe et al., 2007).
The following code produces the estimators shown in Figure 2:
Summary measures of the ROC curve serve as metrics for comparing markers. The area under the covariate-adjusted ROC curve, , can be interpreted as the probability that, for a random case and control marker observation with the same covariate value, the case observation is higher than the control. This is not a clinially relevant summary of marker performance, as the task is not to determine which of a pair of subjects is the case. Moreover, the AUC summarizes the entire ROC curve, when frequently only a portion (e.g. low FPRs) is of interest.
A more clinically meaningful summary measure of the covariate-adjusted ROC curve is the ROC curve (TPR) at a fixed FPR = f of interest. This can be interpreted as the percent of cases detected when the covariate-specific FPRs are held at f. Alternatively, the FPR corresponding to a specific TPR = ROC−1(t) could be reported. This is the common covariate-specific FPR associated with a proportion t of cases detected.
The partial area under the ROC, , can be viewed as a compromise between the AUC and the ROC at a specified point. It has the advantage of focusing on a portion of the ROC, but it lacks a clinically relevant interpretation.
The ROC summary measures are estimated in the same way as their counterparts for the traditional ROC curve. The AUC estimate is the sample average of the case standardized marker values,
where the sum is over the nD case observations. When the case percentile values are estimated non-parametrically (i.e. with stratification on Z), this is a weighted average of empirical AUCs in each covariate stratum. The estimated pAUC is also an average of standardized marker values (Dodd and Pepe, 2003),
When the control marker distribution is estimated empirically, corrections are made for ties between case and control marker observations, as discussed in the companion paper (Pepe et al., 2007).
Estimates of AUC and pAUC values for parametric ROC models generally require numerical integration and are not produced by our programs. Instead the parameters are estimated using empirical averages of percentile values, as in equations (1) and (2). Similarly, we estimate ROC curves at fixed FPR = f by calculating the proportion of percentile values that are greater than 1 − f, rather than the value estimated by a parametric ROC model.
Comparisons between ROC curves can be made using any of the summary indices discussed above. A confidence interval for the difference in summary measures is calculated using the bootstrap. A Wald statistic, which divides the observed difference by its standard error, is compared to the standard normal distribution to obtain a p-value. Standard errors are obtained by bootstrapping. The comproc command is used to compare ROC curves.
The syntax of the comproc command is
comproc disease_var test_var1 [test_var2] [if] [in] [, options]
where disease_var is the binary outcome and test var1 and test var2 are the two markers to be compared. If only one marker is specified, the requested summary statistics are returned but no comparisons are made.
Marker standardization and bootstrap options are the same as with the roccurve command. The choices of summary measures are: auc, the area under the ROC; pauc(f), the partial area under the ROC; roc(f), the TPR corresponding to a FPR of f; and rocinv(t), the FPR corresponding to a TPR of t. The tiecorr option can be used to correct for ties between case and control marker observations. It is used by default if pauc(f) is among the summary measures specified.
Consider again the audiology data. Figure 3 shows the ROC curves for the markers DPOAE and TEOAE, both adjusted for age and gender. The covariates are assumed to act linearly on control marker observations, and the marker distributions and ROC curves are estimated empirically. The comproc command yields estimates of the associated ROC curves at a FPR of f = 0.20, as well as pAUC(0.20) and AUC, as shown below. We conclude that there is no evidence of a difference in the percent cases detected when the FPR is 20%. Comparisons based on the pAUC(0.20) and AUC similarly suggest that there is no difference in performance between the two markers.
The comproc command applied to the audiology data yields the following results:
Covariates such as disease severity and specimen storage time can do more than impact marker observations among controls. They often also impact the inherent discriminatory accuracy of the marker (i.e. the ROC curve). That is, they affect the separation between the case and control marker distributions. A hypothetical example is shown in Figure 4. The data can be downloaded from the DABS website. Observe that the separation between the case and control marker distributions is much greater when Z = 0 than when Z = 1. The covariate also affects the distribution of the marker among controls, necessitating covariate adjustment.
ROC regression is a methodology that models the marker’s ROC curve as a function of covariates (Pepe, 2000; Alonzo and Pepe, 2002). Implementation proceeds in two steps: 1) model the distribution of the marker among controls as a function of covariates. Calculate the case percentile values, and; 2) model their CDF (i.e. the ROC curve) as a function of covariates. The result is an estimate of the ROC curve for the marker as a function of covariates, in other words a covariate-specific ROC curve. We emphasize that the covariates used in step (1) for adjustment are those that affect the marker distribution in the control population; these may or may not differ from the covariates that impact the separation between cases and controls, used in step (2).
Estimation of the control marker distribution as a function of covariates and calculation of the case percentile values proceeds in exactly the same manner as with the covariate adjustment method. The standardization options allowed by rocreg are the same as with roccurve and comproc. The covariates may be assumed to act linearly on marker observations, or stratification can be employed if they are discrete. The percentile values can be calculated by estimating the control marker distribution conditional on covariates empirically or by assuming a normal model.
Next, a parametric model is specified for the ROC curve (i.e. the CDF of the case percentile values) as a function of covariates. The forms allowed by the rocreg program are
where g(·) is the standard normal CDF or the logistic function. The parameter α2 allows the covariates to impact the “intercept” of the ROC curve, while α3 allows Z to impact the “slope” of the ROC curve. If α3 ≠ = 0, the covariates have a different impact on the ROC curve at different FPRs. Observe that this ROC model gives rise to binormal (Dorfman and Alf, 1969) or bilogistic (Ogilvie and Creelman, 1968) ROC curves at each fixed value of Z.
In order to fit the ROC regression model, a discrete set of FPR points, f1, …, fnp is chosen. These points may span (0, 1) or a subinterval of interest, (a, b). For each case observation, a set of np records is created. The kth record for the ith subject includes the binary outcome and covariates: g−1(fk), Zi, and Zi × g−1(fk). A binary regression model with link g, outcome U, and covariates: g−1(f), Z, and Z × g−1(f) provides estimates of (α0, α1, α2, α3) (Alonzo and Pepe, 2002). Bootstrapping is used for inference, where the data are resampled according to the design.
For illustration, an ROC regression model was fit for DPOAE using the audiology data. DPOAE observations among controls are assumed to depend linearly on age and gender, and their distribution is estimated empirically. Age-specific ROC curves are modelled parametrically using
Estimates of the age-specific ROC curves are calculated using the parameter estimates (0, 1, 2). Figure 5 shows estimated binormal ROC curves for children at 30, 40, and 50 months of age. This figure suggests that the marker is more accurate among older children, but the effect is not statistically significant (see below).
The syntax of the rocreg command is
rocreg disease_var test_varlist [if] [in] [, options]
where disease_var is the binary outcome and test_varlist is the list of markers.
The options for marker standardization are the same as with roccurve and comproc. Covariates may or may not be used for adjustment.
The option regcov(varlist) specifies the list of covariates that have the same impact on the ROC curve at all FPRs. sregcov(varlist) specifies covariates that impact the ROC curve differently at different FPRs.
The option link governs whether the model assumes a binormal or bilogistic ROC curve at each value of Z. The interval(a b np) option can be used to specify that the model is fit at np points over the restricted FPR interval (a, b).
Bootstrapping is used for inference. The default is that data are resampled conditional on the binary outcome. Bootstrap sampling options are as with roccurve.
Parameter estimates from the ROC regression fit and the corresponding bootstrap covariance matrix are available as bootstrap postestimation results.
The rocreg command applied to the audiology data produced the following results:
Another way of incorporating covariate information is by evaluating incremental value. When Z is a set of known risk factors or other baseline predictors, an obvious question concerns the improvement in classification accuracy associated with adding Y to Z. Note that within this framework, Z is allowed to help in discriminating between cases and controls. This is in contrast to covariate adjustment methods which characterize the classification accuracy of Y conditional on Z.
Incremental value is quantified by comparing the ROC curve for (Y, Z) to the ROC curve for Z alone. The optimal combination of Y and Z for classification is the risk score, P(D = 1|Y, Z) (McIntosh and Pepe, 2002). The risk score can be estimated using a wide variety of binary regression techniques, including logistic regression, logic regression, classification trees, neural networks, and support vector machines.
We propose the following approach to estimating incremental value. First, we fit logistic regression models with and without the marker, Y:
where g(·) = exp(·)/(1 + exp(·)) is the logistic function. Forms other than linear can be employed for the predictors (e.g. splines), and interactions may or may not be included. The primary advantage of using logistic regression is that the linear predictors, g−1(P(D = 1|Y, Z)) and g−1(P(D = 1|Z)), which have the same ROC curves as the risk scores, are consistently estimated (up to constants) with case-control data (Prentice and Pyke, 1979).
Having fit the two regression models, we next calculate the associated predicted values for all subjects in the dataset. We analyze the predicted values on the linear predictor scale where distributional assumptions are more easily verified, and again note that the ROC curves for g−1(P(D = 1)) and P(D = 1) are the same.
The final step is to plot and compare the ROC curves for the linear predictions from the two models. This can be accomplished using the programs roccurve and comproc.
This procedure is simplistic in at least two respects. First, fitting and evaluating models on the same data is known to produce overly optimistic estimates of model performance. Cross-validation could be used to correct for this overoptimism. Second, the bootstrapping implemented in roccurve and comproc conditions on the fitted regression models. This accounts for uncertainty in the ROC estimates, but not in the predicted values. Bootstrapping the entire process, from sampling to risk score estimation to ROC estimation, would be desirable. For simplicity, we ignore these issues here, but plan to implement a Stata program in the near future that incorporates cross-validation and bootstrapping of the model fitting process.
We again use the audiology data to illustrate estimation of incremental value. We evaluate the incremental value of the marker DPOAE over and above age and gender. Figure 6 shows ROC curves for two fitted logistic regression models, one including age and gender, and the other including age, gender, and DPOAE. All covariates are modelled linearly. The ROC curves are estimated empirically (without adjustment for any covariates). We see that DPOAE substantially improves the ability of age and gender to discriminate between hearing impaired and unimpaired ears. The commands used to generate the estimators shown are:
The methods and Stata programs presented here facilitate incorporating covariates into ROC analysis in three distinct ways: by characterizing the performance of the marker conditional on covariates (i.e. covariate adjustment), by allowing the accuracy of the marker to depend on the covariates (using ROC regression), and by examining the improvement in classification accuracy associated with adding the marker to the covariates (incremental value). The representation of the ROC curve as the CDF of standardized case marker observations provides a natural means of incorporating covariate information, and gives rise to parametric, semi-parametric, and non-parametric estimates of the quantities of interest.
We have focused on continuous markers but these methods can also be applied to ordinal markers.
The Stata do-files that produce the output and figures included here can be downloaded from the Stata Journal website using: