Search tips
Search criteria 


Logo of biostsLink to Publisher's site
Biostatistics. 2016 July; 17(3): 499–522.
Published online 2016 February 16. doi:  10.1093/biostatistics/kxw003
PMCID: PMC4915610

Evaluating and comparing biomarkers with respect to the area under the receiver operating characteristics curve in two-phase case–control studies


Two-phase sampling design, where biomarkers are subsampled from a phase-one cohort sample representative of the target population, has become the gold standard in biomarker evaluation. Many two-phase case–control studies involve biased sampling of cases and/or controls in the second phase. For example, controls are often frequency-matched to cases with respect to other covariates. Ignoring biased sampling of cases and/or controls can lead to biased inference regarding biomarkers' classification accuracy. Considering the problems of estimating and comparing the area under the receiver operating characteristics curve (AUC) for a binary disease outcome, the impact of biased sampling of cases and/or controls on inference and the strategy to efficiently account for the sampling scheme have not been well studied. In this project, we investigate the inverse-probability-weighted method to adjust for biased sampling in estimating and comparing AUC. Asymptotic properties of the estimator and its inference procedure are developed for both Bernoulli sampling and finite-population stratified sampling. In simulation studies, the weighted estimators provide valid inference for estimation and hypothesis testing, while the standard empirical estimators can generate invalid inference. We demonstrate the use of the analytical variance formula for optimizing sampling schemes in biomarker study design and the application of the proposed AUC estimators to examples in HIV vaccine research and prostate cancer research.

Keywords: AUC, Biased sampling, Frequency match, Inverse probability weighting, ROC curve, Two-phase studies

1. Introduction

Recent advances in lab techniques have provided researchers with a rich resource of biomarkers potentially useful for disease diagnosis and risk prediction. It is essential to use proper statistical methods to rigorously evaluate these biomarkers. The receiver operating characteristics curve (ROC) is a standard graphic tool to characterize a biomarker's classification accuracy. The area under the ROC curve (AUC) has been commonly used to gauge and compare biomarker's performance. In this paper, we consider the evaluation and comparison of biomarkers with respect to AUC for a binary disease outcome using data from two-phase sampling designs. In the first phase, a cohort sample representative of the target population relevant to clinical application is drawn, from which participants' disease status and easy to measure covariates are obtained; in the second phase, a subsample is drawn randomly, without replacement from the phase-one cohort sample for biomarker measurement, where sampling probability of each individual can depend on other covariates available. In particular, we consider studies where cases and controls in the second phase are separately sampled from the phase-one cohort. This is different from, for example, a case–cohort design for failure time, where cases and a random cohort are separately sampled. These types of designs prospectively collect bio-specimens before outcome ascertainment to minimize systematic difference in specimen collection, and retrospectively sample cases and controls for measuring biomarkers from the stored specimens to save costs. They have been proposed as gold standards for biomarker evaluation (Pepe and others, 2008).

In the second phase of a two-phase study, oftentimes cases and/or controls are not simple random samples from their respective distributions. For example, controls are often frequency-matched to cases with respect to other covariates, such as an individual's demographic characteristics (i.e., gender, age group, etc.); or cases and controls can be randomly sampled within some covariate strata. The effect of biased sampling on biomarker evaluation varies with the parameter of interest. Janes and Pepe (2009) showed that frequency matching does not invalidate inference on a biomarker's classification accuracy within matching covariate stratum; however, when the parameter of interest is the biomarker's classification accuracy in the general population, ignoring biased sampling can lead to invalid inference (Pepe and others, 2012). Inverse probability weighting provides a natural solution to account for biased sampling of cases and/or controls when evaluating marker classification performance in the population, e.g., in Pepe and others (2012) for binary disease outcome, and in Cai and Zheng (2011) for failure time outcome. Similar strategies of weighting to estimate classification accuracy under biased sampling have been adopted by other authors in a different problem setting: for correcting verification bias when true disease status (instead of the biomarker) is ascertained only from a subset of subjects (He and others, 2009).

However, asymptotic theories for estimators of AUC or difference in AUC for binary disease outcome in two-phase studies where expensive biomarkers are only measured in a subset of the study cohort, are lacking, despite the commonality of AUC in characterizing and comparing diagnostic tests in biomarker research. Our research in this paper aims to fill in this gap. In particular, we will develop inverse-probability-weighted (IPW) AUC estimator in two-phase study designs and develop inference procedures for comparing two diagnostic biomarkers. The closed-form expressions of asymptotic variances we develop for the proposed estimators will be valuable for understanding the implication of frequency matching on efficiency of biomarker evaluation.

This paper will consider two types of two-phase sampling designs: the finite-population stratified sampling (Neyman, 1938) and the Bernoulli sampling (Manski and Lerman, 1977). The former design is commonly used in biomarker research and is the major focus of this paper. The latter design has the advantage of simplicity and will be introduced as a pathway for studying the results in finite-population stratified sampling design. The two designs differ in how individuals are sampled for biomarker measurement in phase two. Finite-population stratified sampling requires pre-specification of a finite number of covariate strata: fixed number of cases and/or controls are then sampled from each stratum. It has the advantage that the number of cases and controls sampled from each stratum in phase two can be fixed at the outset. In Bernoulli sampling, each individual is selected with a known sampling probability that can depend on one's disease status and covariate value, independently of other individuals. It works in more general settings without the need to pre-specify a finite number of strata, e.g., when outcome and/or auxiliary covariates observed in phase one are continuous. The number of cases and/or controls sampled in phase two are random in Bernoulli sampling design. Despite the differences between the two designs, theoretical results of their estimators are closely connected.

In Section 2, we will start with the problem setting in Bernoulli sampling design and propose IPW estimators of AUC and difference in AUC using estimated sampling weights. We will then investigate estimation under finite-population stratified sampling design and show the connection in theoretical results of AUC estimators between the two designs. In Section 3, we conduct simulation studies to demonstrate performance of our estimators and subsequent inference procedures, compared with the empirical AUC estimator and hypothesis test ignoring biased sampling. In Section 4, we demonstrate, using a numerical example, the use of the analytical variance formula to guide the optimization of biomarker sampling scheme. The application of our proposed AUC estimators will be demonstrated by data examples in HIV vaccine research and prostate cancer research in Section 5. Concluding remarks are made in Section 6.

2. Methods

Let equation M1 be a binary outcome of interest to differentiate. In this paper, we consider equation M2 to be a binary disease outcome, with value 0 and 1 indicating non-diseased and diseased, respectively. We use subscript equation M3 and equation M4 to indicate case and control, respectively. Let equation M5 be a continuous biomarker that is expensive to measure such as a lab assay, assuming that increase in equation M6 is associated with increased likelihood of disease. Suppose we have data collected from a two-phase study to evaluate the marker's classification accuracy. In the first phase, subjects' disease status and covariates that are easy to measure such as demographics are collected from a random sample of size equation M7 from the target population, with equation M8 and equation M9 the number of cases and controls, respectively. In the second phase, the phase-one cohort is further subsampled to measure the biomarker value. Our first objective is to estimate AUC for marker equation M10: equation M11. In addition, suppose there is another continuous marker equation M12 measured together with marker equation M13. Let equation M14. Our second objective is to make inference about the difference in AUC between markers equation M15 and equation M16, i.e., equation M17.

2.1. Bernoulli sampling

2.1.1. Evaluation of a single marker.

First, consider the standard Bernoulli sampling where in phase two individuals are selected independently of others with pre-specified probabilities. For case equation M18 in the phase-one cohort, let equation M19 be the indicator that one's biomarker value is collected in the second phase, with equation M20 the corresponding sampling probability. Similarly, for control equation M21 in the phase-one cohort, let equation M22 and equation M23 indicate whether he/she is sampled in the second phase and the corresponding sampling probability. Note that equation M24 and equation M25 are individual-specific, whose values can depend on covariate value for case equation M26 and control equation M27. For example, suppose phase-two sampling probabilities of cases and controls depend on discrete covariate strata: among phase-one samples, cases are allocated into equation M28 strata with equation M29 cases in stratum equation M30, and controls are allocated into equation M31 strata with equation M32 controls in stratum equation M33; in the second phase, cases in stratum equation M34 are independently sampled with probability equation M35 and controls in stratum equation M36 are independently sampled with probability equation M37. Let equation M38 and equation M39 denote the number of phase-two cases and controls sampled from strata equation M40 and equation M41. They are random numbers with expected values equation M42 and equation M43, respectively. We have equation M44 for case equation M45 belonging to stratum equation M46 and equation M47 for control equation M48 belonging to stratum equation M49.

For estimation of population AUC, when phase-two case and control samples are representative of their respective populations, standard empirical AUC estimator (Bamber, 1975) equation M50 using phase-two biomarker data provides a valid estimate. However, when phase-two case and control samples are not representative of their respective populations, equation M51 can be severely biased. For example, it is common in biomarker study designs that simple random samples of cases are drawn in the second phase, while controls are matched to cases by covariate strata such that control biomarkers are not representative of their population. To take care of the biased sampling, we propose the use of a weighted estimator of AUC based on the idea of inverse-probability weighting (Horvitz and Thompson, 1952). In particular, we construct IPW versions for the numerator and denominator of the empirical AUC estimator, where the contribution of each participant to a case–control pair is weighted by inverse of the estimated sampling probability of the participant in phase two:

equation M52

where equation M53 and equation M54 indicate estimated phase-two sampling probabilities for case equation M55 and control equation M56. For continuous covariate, equation M57 and equation M58 can be estimated by parametric models such as the logistic regression model. For discrete covariate strata, their empirical estimates can be derived. That is, for a case equation M59 belonging to stratum equation M60, his/her sampling probability equation M61 is estimated with equation M62, the proportion of phase-one cases sampled in phase two from stratum equation M63; similarly one can estimate equation M64 for control equation M65 in stratum equation M66 with equation M67.

In Bernoulli sampling, the true sampling probability for each individual is known, but using “estimated” weight can improve efficiency (Web Appendix A&B, see supplementary material available at Biostatistics online). This has also been recommended in other problem settings such as the weighted likelihood estimators (Robins and others, 1994; Breslow and Wellner, 2007). Intuitively it holds because using known sampling weights only involves data for subjects sampled at phase two but estimation of the weights allows incorporation of phase-one data available for all subjects, e.g., the number of phase-one cases/controls in each strata in scenarios where sampling probability in phase two varies across discrete strata.

Suppose we model sampling probabilities of the biomarker among cases and controls separately with finite-dimensional parameters equation M68 and equation M69. Let equation M70 and equation M71 be maximum likelihood estimators. Asymptotic distribution of the IPW AUC estimator based on corresponding estimates of equation M72 and equation M73 is stated below in Theorem 1 (proof in Web Appendix B, see supplementary material available at Biostatistics online).

Theorem 1. —

Suppose equation M74 as equation M75 and equation M76 for each case and control; then equation M77 converges asymptotically to a normal random variable with mean 0 and variance

equation M78


equation M79

are information matrices for estimating equation M80 and equation M81, equation M82, equation M83, equation M84, and equation M85.

2.1.2. Comparison of two markers.

For marker equation M86 measured together with equation M87, we can similarly estimate its AUC as equation M88, and estimate the difference in AUC between the two markers with equation M89. Asymptotic distribution of equation M90 is shown in Theorem 2 (proof in Web Appendix C, see supplementary material available at Biostatistics online).

Theorem 2. —

Suppose equation M91 as equation M92 and equation M93 for every case and control; then equation M94 converges asymptotically to a normal random variable with mean 0 and variance

equation M95

Previously, many authors have studied inference for comparing AUC between paired markers (Hanley and McNeil, 1983; DeLong and others, 1988; Wieand and others, 1989; Obuchowski and McClish, 1997). These tests were developed, however, for scenarios where cases and controls are randomly sampled from their respective distributions, and thus are not applicable for settings when there is biased sampling associated with cases and/or controls. In contrast, IPW estimator of equation M96 and its analytical variance as presented in Theorem 2 can be used to construct Wald tests for equal AUC between markers, as will be shown later in simulation studies.

2.2. Finite-population stratified sampling

Now we consider the finite-population stratified sampling design, the design commonly used in biomarker studies. Again suppose cases and controls among phase-one samples are allocated into equation M97 and equation M98 strata, respectively, with number equation M99 and equation M100 in each stratum. Fixed numbers of cases equation M101 and controls equation M102 are then sampled in phase two from these covariate strata to measure the biomarker equation M103. Sampling fractions equation M104 and equation M105 can be random.

Let equation M106 be the stratum indicator among cases taking unique values equation M107 and equation M108 be the stratum indicator among controls taking unique values equation M109. Compute equation M110 and equation M111. In finite-population stratified sampling, sampling probability of a case or control is constant within their corresponding stratum, i.e., equation M112 for case equation M113 in stratum equation M114 and equation M115 for control equation M116 in stratum equation M117. The IPW estimator of AUC for marker equation M118 (2.1) can be equivalently represented as

equation M119

Suppose, as equation M120, sampling fractions for cases among stratum equation M121 converge with equation M122, and sampling fractions for controls among stratum equation M123 converge with equation M124. Then the asymptotic variance of the IPW AUC estimator (2.2) in finite-population stratified sampling is identical to the asymptotic variance of equation M125 in Bernoulli sampling, if, in phase two of the latter design, cases in stratum equation M126 are sampled independently with probability equation M127 and controls in stratum equation M128 are sampled independently with probability equation M129, with sampling probabilities estimated empirically. A proof is given in Web Appendix C (see supplementary material available at Biostatistics online). The equality in asymptotic variance of equation M130 between the two designs can be similarly derived. The same argument on efficiency of weighted likelihood estimators in Cox regression comparing the two designs was made earlier (Breslow and Wellner, 2007).

3. Simulation studies

In this section, we conduct simulation studies to investigate performance of the proposed IPW estimators of AUC and equation M131. We consider a binary disease outcome equation M132 with prevalence equation M133 in the population. Let equation M134 be a continuous covariate that follows the standard normal distribution among controls equation M135 and equation M136 among cases equation M137. Let equation M138 be a discrete covariate stratum derived from equation M139 with three levels: equation M140 if equation M141, equation M142 if equation M143, and equation M144 if equation M145, where equation M146 is the CDF of the standard normal distribution. We consider two biomarkers equation M147 and equation M148, where equation M149 are jointly normally distributed conditional on equation M150, with equation M151, equation M152, and equation M153 the correlations between equation M154 and equation M155, between equation M156 and equation M157, and between equation M158 and equation M159, respectively, conditional on equation M160. Among controls, equation M161 and equation M162 each follows the standard normal distribution. Among cases, equation M163 follows equation M164 and equation M165 follows equation M166. The ROC curve based on equation M167 and equation M168 individually is thus equation M169 with equation M170 and equation M171 with equation M172.

We generate data from two-phase studies. In the first phase, equation M173 subjects are randomly sampled from the population, whose equation M174 and equation M175 values are measured. In the second phase, we considered both Bernoulli sampling and finite-population stratified sampling of cases and controls for measuring markers equation M176 and equation M177, assuming that they are measured on the same set of subjects. In Bernoulli sampling, cases are sampled independently with a constant probability equation M178, and controls are sampled independently with a probability that depends on the stratum equation M179. In particular, the sampling probability for a control in stratum equation M180 equals equation M181. This ensures that, on average, biomarkers are measured on equal numbers of cases and controls within each stratum. In finite-population stratified sampling, equation M182 cases are sampled without replacement, and then within each equation M183 stratum, the same number of controls as cases in that stratum are drawn without replacement. This type of sampling design where simple random samples of cases are drawn in the second phase while controls are matched to cases by covariate strata is common in biomarker research. We also investigate other scenarios where both case and control sampling probabilities in phase two vary across strata. The comparative performance of various estimators is similar to the results we will present below (results omitted).

Based on 5000 Monte Carlo simulations in each setting, we evaluate performance of AUC estimators for individual markers. We compute equation M184, and equation M185 with sampling probabilities estimated empirically among cases and among controls conditional on sampling strata. These estimators are compared with respect to bias, efficiency, coverage of 95% Wald confidence intervals (CIs) based on analytical variance estimates, and the power to test equation M186. We also evaluate performance of corresponding estimators for equation M187. We compare Wald tests based on equation M188 and the common Delong–Delong test (DeLong and others, 1988) with respect to type-I error rate and power for testing equation M189. The latter is implemented in the R package pROC (Robin and others, 2011).

3.1. Evaluate performance of a single marker

Table Table11 gives performance of equation M190 estimators for equation M191, with equation M192, equation M193, and the constant case sampling rate equation M194, for both Bernoulli sampling and finite-population stratified sampling. For both designs, the empirical AUC estimator is biased (with 5–15% relative bias), while equation M195 has minimum bias. While negative biases for empirical AUC estimator were observed for the particular simulation settings presented here, in general this estimator can have both positive and negative biases depending on the setting. Coverage of 95% Wald CIs for equation M196 is close to the nominal level, while the CIs based on the empirical AUC has an undercoverage problem. The Wald test for equation M197 based on equation M198 has type-I error close to the nominal level, while the test based on empirical AUC has inflated type-I error, the inflation getting worse with the increase in sample size. The equation M199 estimators in the two different sampling designs have similar variances.

Table 1.
Performance of different equation M200 estimators for the bi-normal marker model described in Section 3. Disease prevalence is equation M201. Biomarker equation M202 is standard normal among controls. By equation M203 and equation M204 we indicate the expected number of cases and controls sampled in phase two for Bernoulli ...

3.2. Compare performance between markers

Table Table22 shows performance of equation M348 estimators for both types of sampling designs for settings where equation M349 and equation M350 have the same variance and same correlation with equation M351 conditional on equation M352: equation M353 and equation M354, where we have equation M355, equation M356, and equation M357. The IPW equation M358 estimator has good performance: minimum bias, coverage of 95% CI and type-I error for testing the equivalence in AUC close to nominal level. When markers equation M359 and equation M360 have exactly the same distribution (consequently same ROC curve and AUC), the empirical estimator of equation M361 is also unbiased: the biases in equation M362 and equation M363 are equal and thus cancel out, due to the equality in the distribution of the two markers and in their correlation with the matching stratum. The coverage of its 95% CI is close to the nominal level. Type-I error for testing the equivalence in AUC using the Delong–Delong test is also close to the nominal level. When equation M364 and equation M365, the empirical equation M366 estimator has small bias, with a magnitude much smaller compared with that of the equation M367 estimator; its 95% CI has good coverage when sample size is small but slight undercoverage when sample size gets large equation M368. Despite the bias in empirical equation M369, the Delong–Delong test for equivalence in AUC can have advantage in power compared with the IPW estimator in this particular setting, due to the positive bias in AUC difference.

Table 2.
Performance of various estimators of equation M370 for scenarios where the two markers have same variability and same correlation with covariate equation M371 conditional on equation M372 for the bi-normal model described in Section 3. Disease prevalence is equation M373. Marker equation M374 and equation M375 is each standard ...

Table Table33 shows results of equation M405 estimators for settings again with equal variance between equation M406 and equation M407 conditional on equation M408, i.e., equation M409. However, unlike Table Table22 where both markers have the same correlation with the covariate stratum, here we fix equation M410 to be 0.5 but vary equation M411 from 0.4 to 0.1. When markers equation M412 and equation M413 have the same distribution and AUC, the empirical estimator of equation M414 is biased because the magnitude of bias is different between equation M415 and equation M416 due to the difference in correlation between each marker and the covariate stratum. Bias is also observed when equation M417. Corresponding 95% CI based on equation M418 has an undercoverage problem. The Delong–Delong test also has an inflated type-I error rate. The inflation gets more severe as the difference between equation M419 and equation M420 increases: type-I error can become similar to power for some settings in Table Table33 or become even larger than power in some other constructed settings (details omitted). This is due to the bias in the empirical equation M421 estimator such that the observed difference between two markers equivalent in AUC can appear similar or even larger compared with the observed difference between two markers that differ in AUC. In practice, when difference in correlations between marker and stratification variable exists, its magnitude is likely on the small to medium side, and thus we expect some inflation of type-I error when applying Delong–Delong's test but not extreme. In contrast, the IPW estimator of equation M422 are approximately unbiased with coverage of 95% CIs close to the nominal level; corresponding Wald tests for equivalence in AUC between markers have well-controlled type-I error rates.

Table 3.
Performance of different estimators of equation M423 for scenarios where the two markers have same variability but different correlation with covariate equation M424 conditional on equation M425 for the bi-normal marker model described in Section 3. Disease prevalence is equation M426. Marker equation M427 and equation M428 is ...

Table Table44 presents results for settings with equation M505. That is, the variability of marker equation M506 is larger than that of equation M507 among cases. Note that when equation M508, we have equation M509, although the two markers can have different ROC curves when equation M510. When this happens, the Delong–Delong test has an inflated type-I error, even when correlation with the stratification variable is the same for both markers; whereas the Wald test based on the IPW estimator of equation M511 has a well-controlled type-I error rate. For both the scenarios with equation M512 and equation M513, the empirical equation M514 estimator is biased; corresponding 95% CI undercovers the true parameter value. In contrast, the IPW estimators have minimum bias and good coverage of 95% CIs.

Table 4.
Performance of different estimators of equation M515 for scenarios where the two markers have different variability among cases, for the bi-normal model described in Section 3. Disease prevalence is equation M516. Marker equation M517 and equation M518 are each standard normal among controls. Here we have ...

In Web Appendix H (see supplementary material available at Biostatistics online), we also present additional simulation results when biomarkers follow gamma distributions conditional on disease status. The conclusion regarding the performance of the IPW and the empirical AUC estimators is similar to the bi-normal marker model.

4. Implication on efficiency of sampling schemes

The analytical variance formula we developed in Section 2 will be valuable to biomarker researchers for studying the impact of the sampling scheme on the efficiency of biomarker performance estimators. We demonstrate that using an example comparing two designs with the same number of participants measuring disease outcome, covariate, and biomarker. The setting is similar to that in Section 3, with disease prevalence equation M612. Covariate equation M613 and biomarker equation M614 are bivariate normal with correlation equation M615 conditional on equation M616. Among controls equation M617 and equation M618 are each standard normal; among cases equation M619 and equation M620. Let equation M621 be a binary covariate stratum derived from equation M622, with equation M623 if equation M624 and equation M625 otherwise. We compare two sampling designs. Both are two-phase studies with a random cohort sample of size equation M626 drawn in the first phase. In the second phase, both designs include all cases from a phase-one sample, i.e., equation M627; a simple random sample of controls of size equation M628 are drawn without replacement in Design 1, simple random samples of controls with the same number as cases are drawn without replacement from each equation M629 stratum in Design 2. In Design 1, phase-two case and control samples are representative of their respective distributions; thus the empirical estimator of AUC is valid and is considered for this design. In contrast, phase-two case and control samples in Design 2 are not representative of their respective distributions; thus we use the proposed equation M630 with empirically estimated sampling weights conditional on sampling strata for Design 2.

Following Theorem 1 and Section 2.4, asymptotic variance of equation M631 in Design 2 equals

equation M632

Since equation M633 in Design 1 can be thought of as an equation M634 estimator where there is only one sampling stratum for cases and for controls, its asymptotic variance can be similarly derived as

equation M635

for equation M636. The result also follows DeLong and others (1988).

Figure Figure11 shows the relative asymptotic efficiency of equation M637 in Design 2 versus equation M638 in Design 1 for two different equation M639 values, as equation M640 changes from equation M6410.9 to 0.9. A common U-shape is observed, where Design 2 is more efficient as the magnitude of the correlation increases, whereas Design 1 can be more efficient when the correlation is small. Comparing variance formulae of the two estimators, their difference arises from two components: (i) the difference between equation M642 and equation M643, and (ii) the difference between equation M644 and equation M645. Note that when equation M646 and equation M647 are not correlated among controls, the difference in variance between the two estimators is solely due to component (i): simple random sampling without replacement from controls is more efficient since the weighted average of equation M648 for equation M649 is larger than equation M650 in our example. As equation M651 and equation M652 become more correlated, variance of equation M653 conditional on the covariate stratum becomes smaller compared with its variance among all controls, and thus stratified sampling is more efficient. This is consistent with the use of stratified sampling in survey sampling as a possible way to increase efficiency in estimating parameters such as the population total when a heterogeneous population can be divided into strata with homogeneous units (Cochran, 2007). The same pattern can be observed when biomarkers conditional on disease status follow gamma distributions (Web Supplementary Figure S3, available at Biostatistics online). In practice, researchers can evaluate efficiency of different sampling schemes based on prior knowledge in the relationship between marker, covariate, and disease.

Fig. 1.
Efficiency of equation M654 in finite-population stratified sampling (FPS) of controls relative to the empirical AUC estimator in simple random sampling (SRS) without replacement of controls. Efficiency = asymptotic variance of empirical AUC estimator in SRS (equation ...

5. Example

5.1. RV144 example

We illustrate the proposed methodology for estimating and comparing AUC with a real example of biomarker study from the RV144 Thailand HIV vaccine trial. The trial included 16 402 participants aged 18–30 who were 1:1 randomized into a vaccine and a placebo arm. Among vaccine recipients in the RV144 trial who were not yet infected at week 26, an immune response study was conducted to assess vaccine-induced immune response based on peak immunogenicity at week 26 following a finite-population stratified sampling design (Haynes and others, 2012). Around 1.8% vaccinees were censored before the end of the study and were treated as non-infected for subsequent sampling. The study includes all 41 vaccinees infected after week 26 visits. The control vaccinees were selected from a stratified random sample of vaccinees free of HIV-1 infection at 42 months, within strata constructed by gender, number of vaccinations received, and per-protocol status, with five times the number of cases within each stratum.

Two of the primary assays studied, the binding of IgG antibodies to variable regions 1 and 2 (V1V2) of the gp120 Env, and the binding of plasma IgA antibodies to Env, were found to correlate significantly with infection risk (Haynes and others, 2012). Here we evaluate and compare AUC of the two markers. In this application, equation M656 will be HIV infection at 42 months, and equation M657 and equation M658 are V1V2 and IgA measures at week 26, respectively.

First, the naive empirical AUC estimate is 0.573 (95% equation M659) for V1V2, and 0.596 (95% equation M660) for IgA. There is around 4% relative increase in AUC for IgA compared with V1V2, although the difference is not statistically significant (equation M661-value equation M662 0.774 based on the Delong–Delong test). With empirically estimated sampling weights conditional on the matching stratum, the equation M663 equals 0.588 (95% equation M664) for V1V2 and 0.589 (95% equation M665) for IgA. The difference in AUC between the two markers becomes even smaller after accounting for the sampling scheme with a equation M666-value of 0.997 based on the Wald test. In this example, adjusting for the sampling scheme makes a small difference in point estimates due to the relatively small variability in sampling weights of controls across strata; the observation that the IPW estimate of the AUC difference between IgA and V1V2 is smaller than the empirical estimate is consistent with observations made in simulation studies, where bias in equation M667 can make two markers look more different when they have equal AUC.

5.2. Prostate cancer study example

In the second example, we consider a prospective study conducted by the Early Detection Research Network aimed to assess a urine biomarker for prostate cancer, the Prostate Cancer Antigen 3 (PCA3) (Deras and others, 2008). This study involved 570 men enrolled at four North American sites scheduled for prostate biopsy, with a prostate cancer equation M668 prevalence of 36.6%. Urinary PCA3 equation M669 and serum PSA (prostate-specific agent, equation M670) are obtained from every participant using specimens collected before biopsy. Each biomarker is log-transformed and standardized to have mean zero and variance 1 among subjects without prostate cancer. Among those with prostate cancer, PCA3 and PSA have mean 0.64 and 0.42, and variance 1.0 and 0.83, respectively. Pearson correlations with age are 0.40 for PCA3 and 0.15 for PSA among those without prostate cancer, and 0.24 for PCA3 and 0.28 for PSA among those with prostate cancer. Increase in age also appears to be associated with increased risk of prostate cancer.

To illustrate application of our methodology, we perform a finite-population stratified sampling based on age strata generated by the first and second tertiles of age distribution among controls, i.e., age equation M67160, 60–67, equation M67267 years. We randomly sampled 120 cases and then sampled the same number of controls as cases within each age stratum. Based on the case–control sample, the empirical AUC estimate is 0.553 (95% equation M673) for PSA and 0.645 (95% equation M674) for PCA3, with a difference not statistically significant (equation M675 with 95% CI (equation M6760.004, 0.188), and equation M677-value equation M678 0.061 based on the Delong–Delong test). In contrast, equation M679 with empirically estimated sampling weights conditional on the stratum equals 0.553 (95% equation M680) for PSA and 0.671 (95% equation M681) for PCA3, with a significant difference detected (equation M682 with 95% CI (0.017, 0.206), equation M683-value equation M684 0.021 based on the Wald test). Note that there is a significant difference between the two markers if we estimate the empirical AUC based on the full cohort sample (equation M685, 95% equation M686, equation M687-value equation M688 0.0003 based on Delong–Delong's test). Thus, with the case–control sample only, we would have identified the significant difference with the weighted estimator, but have missed it using the empirical estimator.

6. Concluding remarks

In this paper, we developed methods for assessing a biomarker's classification accuracy for a binary disease as characterized by the area under the ROC curve in two-phase sampling designs. Finite-population stratified sampling of biomarkers from a phase-one cohort representative of the target population has become increasingly common in recent years with the availability of large clinical trials and cohort studies. But statistical methods to properly handle the design components when assessing biomarkers are not yet well developed. Empirical area under the ROC curve estimated from the phase-two biomarker samples was often reported in applied literatures even in the presence of biased sampling of cases and/or controls. We showed in this paper that empirical estimators of AUC ignoring the biased sampling scheme can lead to severely biased estimates of classification accuracy and invalid inference for comparing biomarkers. We investigated an inverse-sampling-probability-weighted estimator that achieves unbiased estimation of AUC and developed asymptotic variance formulae applicable to inference in finite-population stratified sampling, through its connection with the IPW estimator in the standard Bernoulli sampling design. The analytical variance formulae we developed will provide valuable guidance to biomarker researchers on optimizing the sampling scheme in designing future biomarker studies, in order to achieve better efficiency in evaluating biomarkers for classification accuracy. For Bernoulli sampling, we observed analytically and numerically that using estimated sampling weights is more efficient even when the true sampling weights are known by design (Web Supplementary Appendix A–C, F–H, see supplementary material available at Biostatistics online). In particular, using estimated weights can lead to much improved efficiency for estimating performance of a single marker. In contrast, the improvement due to weights estimation is relatively minor for comparing performance of paired markers; when sample size is small, the CI of the equation M689 estimator based on known sampling weights can have slightly better coverage than that based on estimated weights.

In this paper, we considered simple design-based weights, the estimation of which does not require more information beyond the sampling strata. In etiology studies, it has been shown that when auxiliary variables are available from a phase-one cohort, they can be used to further adjust the weights for potential efficiency gains in estimating the disease odds ratio or hazard ratio (Breslow and others, 2009). It is interesting in future work to investigate the impact of weight adjustments using auxiliary variables on biomarker performance measure such as the AUC. Finally, the inverse-probability-weighting methods can be naturally applied to other performance measures such as the points on the ROC curve and the partial area under the ROC curve.


This work was supported by the U.S. National Institutes of Health grants R01 GM106177-01 and R01 GM54438. Funding to pay the Open Access publication charges for this article was provided by NIH grant R01 106177.

Supplementary Material

Supplementary Data:


Conflict of Interest: None declared.


  • Bamber D. (1975). The area above the ordinal dominance graph and the area below the receiver operating characteristic graph. Journal of Mathematical Psychology 124, 387–415.
  • Breslow N. E., Lumley T., Ballantyne C. M., Chambless L. E., Kulich M. (2009). Improved Horvitz–Thompson estimation of model parameters from two-phase stratified samples: applications in epidemiology. Statistics in Biosciences 11, 32–49. [PMC free article] [PubMed]
  • Breslow N. E., Wellner J. A. (2007). Weighted likelihood for semiparametric models and two-phase stratified samples, with application to cox regression. Scandinavian Journal of Statistics 341, 86–102. [PMC free article] [PubMed]
  • Cai T., Zheng Y. (2011). Evaluating prognostic accuracy of biomarkers in nested case–control studies. Biostatistics 131, 89–100. [PMC free article] [PubMed]
  • Cochran W. G. (2007) Sampling Techniques. John Wiley & Sons, New York.
  • DeLong E. R., DeLong D. M., Clarke-Pearson D. L. (1988). Comparing the areas under two or more correlated roc curves: a nonparametric approach. Biometrics 44, 837–845. [PubMed]
  • Deras I. L., Aubin S. M. J., Blase A., Day J. R., Koo S., Partin A. W., Ellis W. J., Marks L. S., Fradet Y., Rittenhouse H. and others (2008). Pca3: a molecular urine assay for predicting prostate biopsy outcome. The Journal of Urology 1794, 1587–1592. [PubMed]
  • Hanley J. A., McNeil B. J. (1983). A method of comparing the areas under receiver operating characteristic curves derived from the same cases. Radiology 1483, 839–843. [PubMed]
  • Haynes B. F., Gilbert P. B., McElrath M. J., Zolla-Pazner S., Tomaras G. D., Alam S. M., Evans D. T., Montefiori D. C., Karnasuta C., Sutthent R. and others (2012). Immune-correlates analysis of an HIV-1 vaccine efficacy trial. New England J. of Medicine 36614, 1275–1286. [PMC free article] [PubMed]
  • He H., Lyness J. M., McDermott M. P. (2009). Direct estimation of the area under the roc curve in the presence of verification bias. Statistics in Medicine 283, 361–376. [PMC free article] [PubMed]
  • Horvitz D. G., Thompson D. J. (1952). A generalization of sampling without replacement from a finite universe. Journal of the American Statistical Association 47260, 663–685.
  • Janes H., Pepe M. S. (2009). Adjusting for covariate effects on classification accuracy using the covariate-adjusted roc curve. Biometrika 962, 371–382. [PMC free article] [PubMed]
  • Manski C. F., Lerman S. R. (1977). The estimation of choice probabilities from choice based samples. Econometrica: Journal of the Econometric Society, 458, 1977–1988.
  • Neyman J. (1938). Contribution to the theory of sampling human populations. Journal of the Acoustical Society of America 33201, 101–116.
  • Obuchowski N. A., McClish D. K. (1997). Sample size determination for diagnostic accuracy studies involving binormal roc curve indices. Statistics in Medicine 1613, 1529–1542. [PubMed]
  • Pepe M. S., Fan J., Seymour C. W., Li C., Huang Y., Feng Z. (2012). Biases introduced by choosing controls to match risk factors of cases in biomarker research. Clinical Chemistry 588, 1242–1251. [PMC free article] [PubMed]
  • Pepe M. S., Feng Z., Janes H., Bossuyt P. M., Potter J. D. (2008). Pivotal evaluation of the accuracy of a biomarker used for classification or prediction: standards for study design. Journal of the National Cancer Institute 10020, 1432–1438. [PubMed]
  • Robin X., Turck N., Hainard A., Tiberti N., Lisacek F., Sanchez J., Müller M. (2011). pROC: an open-source package for r and s+ to analyze and compare roc curves. BMC Bioinformatics 121, 77. [PMC free article] [PubMed]
  • Robins J. M., Rotnitzky A., Zhao L. P. (1994). Estimation of regression coefficients when some regressors are not always observed. Journal of the Acoustical Society of America 89427, 846–866.
  • Wieand S., Gail M. H., James B. R., James K. L. (1989). A family of nonparametric statistics for comparing diagnostic markers with paired or unpaired data. Biometrika 763, 585–592.

Articles from Biostatistics (Oxford, England) are provided here courtesy of Oxford University Press