Home | About | Journals | Submit | Contact Us | Français |

**|**Biostatistics**|**PMC4915610

Formats

Article sections

- Abstract
- 1. Introduction
- 2. Methods
- 3. Simulation studies
- 4. Implication on efficiency of sampling schemes
- 5. Example
- 6. Concluding remarks
- Supplementary material
- Funding
- Supplementary Material
- References

Authors

Related links

Biostatistics. 2016 July; 17(3): 499–522.

Published online 2016 February 16. doi: 10.1093/biostatistics/kxw003

PMCID: PMC4915610

Ying Huang^{*}

Fred Hutchinson Cancer Research Center, Seattle, WA 98109, USA and Department of Biostatistics, University of Washington, Seattle, WA 98109, USA

*To whom correspondence should be addressed. Email: gro.crchf@gnauhy

Received 2015 March 24; Revised 2016 January 3; Accepted 2016 January 4.

Copyright © The Author 2016. Published by Oxford University Press.

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.

Two-phase sampling design, where biomarkers are subsampled from a phase-one cohort sample representative of the target population, has become the gold standard in biomarker evaluation. Many two-phase case–control studies involve biased sampling of cases and/or controls in the second phase. For example, controls are often frequency-matched to cases with respect to other covariates. Ignoring biased sampling of cases and/or controls can lead to biased inference regarding biomarkers' classification accuracy. Considering the problems of estimating and comparing the area under the receiver operating characteristics curve (AUC) for a binary disease outcome, the impact of biased sampling of cases and/or controls on inference and the strategy to efficiently account for the sampling scheme have not been well studied. In this project, we investigate the inverse-probability-weighted method to adjust for biased sampling in estimating and comparing AUC. Asymptotic properties of the estimator and its inference procedure are developed for both Bernoulli sampling and finite-population stratified sampling. In simulation studies, the weighted estimators provide valid inference for estimation and hypothesis testing, while the standard empirical estimators can generate invalid inference. We demonstrate the use of the analytical variance formula for optimizing sampling schemes in biomarker study design and the application of the proposed AUC estimators to examples in HIV vaccine research and prostate cancer research.

Recent advances in lab techniques have provided researchers with a rich resource of biomarkers potentially useful for disease diagnosis and risk prediction. It is essential to use proper statistical methods to rigorously evaluate these biomarkers. The receiver operating characteristics curve (ROC) is a standard graphic tool to characterize a biomarker's classification accuracy. The area under the ROC curve (AUC) has been commonly used to gauge and compare biomarker's performance. In this paper, we consider the evaluation and comparison of biomarkers with respect to AUC for a binary disease outcome using data from two-phase sampling designs. In the first phase, a cohort sample representative of the target population relevant to clinical application is drawn, from which participants' disease status and easy to measure covariates are obtained; in the second phase, a subsample is drawn randomly, without replacement from the phase-one cohort sample for biomarker measurement, where sampling probability of each individual can depend on other covariates available. In particular, we consider studies where cases and controls in the second phase are separately sampled from the phase-one cohort. This is different from, for example, a case–cohort design for failure time, where cases and a random cohort are separately sampled. These types of designs prospectively collect bio-specimens before outcome ascertainment to minimize systematic difference in specimen collection, and retrospectively sample cases and controls for measuring biomarkers from the stored specimens to save costs. They have been proposed as gold standards for biomarker evaluation (Pepe *and others*, 2008).

In the second phase of a two-phase study, oftentimes cases and/or controls are not simple random samples from their respective distributions. For example, controls are often frequency-matched to cases with respect to other covariates, such as an individual's demographic characteristics (i.e., gender, age group, etc.); or cases and controls can be randomly sampled within some covariate strata. The effect of biased sampling on biomarker evaluation varies with the parameter of interest. Janes and Pepe (2009) showed that frequency matching does not invalidate inference on a biomarker's classification accuracy within matching covariate stratum; however, when the parameter of interest is the biomarker's classification accuracy in the general population, ignoring biased sampling can lead to invalid inference (Pepe *and others*, 2012). Inverse probability weighting provides a natural solution to account for biased sampling of cases and/or controls when evaluating marker classification performance in the population, e.g., in Pepe *and others* (2012) for binary disease outcome, and in Cai and Zheng (2011) for failure time outcome. Similar strategies of weighting to estimate classification accuracy under biased sampling have been adopted by other authors in a different problem setting: for correcting verification bias when true disease status (instead of the biomarker) is ascertained only from a subset of subjects (He *and others*, 2009).

However, asymptotic theories for estimators of AUC or difference in AUC for binary disease outcome in two-phase studies where expensive biomarkers are only measured in a subset of the study cohort, are lacking, despite the commonality of AUC in characterizing and comparing diagnostic tests in biomarker research. Our research in this paper aims to fill in this gap. In particular, we will develop inverse-probability-weighted (IPW) AUC estimator in two-phase study designs and develop inference procedures for comparing two diagnostic biomarkers. The closed-form expressions of asymptotic variances we develop for the proposed estimators will be valuable for understanding the implication of frequency matching on efficiency of biomarker evaluation.

This paper will consider two types of two-phase sampling designs: the finite-population stratified sampling (Neyman, 1938) and the Bernoulli sampling (Manski and Lerman, 1977). The former design is commonly used in biomarker research and is the major focus of this paper. The latter design has the advantage of simplicity and will be introduced as a pathway for studying the results in finite-population stratified sampling design. The two designs differ in how individuals are sampled for biomarker measurement in phase two. Finite-population stratified sampling requires pre-specification of a finite number of covariate strata: fixed number of cases and/or controls are then sampled from each stratum. It has the advantage that the number of cases and controls sampled from each stratum in phase two can be fixed at the outset. In Bernoulli sampling, each individual is selected with a known sampling probability that can depend on one's disease status and covariate value, independently of other individuals. It works in more general settings without the need to pre-specify a finite number of strata, e.g., when outcome and/or auxiliary covariates observed in phase one are continuous. The number of cases and/or controls sampled in phase two are random in Bernoulli sampling design. Despite the differences between the two designs, theoretical results of their estimators are closely connected.

In Section 2, we will start with the problem setting in Bernoulli sampling design and propose IPW estimators of AUC and difference in AUC using *estimated* sampling weights. We will then investigate estimation under finite-population stratified sampling design and show the connection in theoretical results of AUC estimators between the two designs. In Section 3, we conduct simulation studies to demonstrate performance of our estimators and subsequent inference procedures, compared with the empirical AUC estimator and hypothesis test ignoring biased sampling. In Section 4, we demonstrate, using a numerical example, the use of the analytical variance formula to guide the optimization of biomarker sampling scheme. The application of our proposed AUC estimators will be demonstrated by data examples in HIV vaccine research and prostate cancer research in Section 5. Concluding remarks are made in Section 6.

Let be a binary outcome of interest to differentiate. In this paper, we consider to be a binary disease outcome, with value 0 and 1 indicating non-diseased and diseased, respectively. We use subscript and to indicate case and control, respectively. Let be a continuous biomarker that is expensive to measure such as a lab assay, assuming that increase in is associated with increased likelihood of disease. Suppose we have data collected from a two-phase study to evaluate the marker's classification accuracy. In the first phase, subjects' disease status and covariates that are easy to measure such as demographics are collected from a random sample of size from the target population, with and the number of cases and controls, respectively. In the second phase, the phase-one cohort is further subsampled to measure the biomarker value. Our first objective is to estimate AUC for marker : . In addition, suppose there is another continuous marker measured together with marker . Let . Our second objective is to make inference about the difference in AUC between markers and , i.e., .

First, consider the standard Bernoulli sampling where in phase two individuals are selected independently of others with pre-specified probabilities. For case in the phase-one cohort, let be the indicator that one's biomarker value is collected in the second phase, with the corresponding sampling probability. Similarly, for control in the phase-one cohort, let and indicate whether he/she is sampled in the second phase and the corresponding sampling probability. Note that and are individual-specific, whose values can depend on covariate value for case and control . For example, suppose phase-two sampling probabilities of cases and controls depend on discrete covariate strata: among phase-one samples, cases are allocated into strata with cases in stratum , and controls are allocated into strata with controls in stratum ; in the second phase, cases in stratum are independently sampled with probability and controls in stratum are independently sampled with probability . Let and denote the number of phase-two cases and controls sampled from strata and . They are random numbers with expected values and , respectively. We have for case belonging to stratum and for control belonging to stratum .

For estimation of population AUC, when phase-two case and control samples are representative of their respective populations, standard empirical AUC estimator (Bamber, 1975) using phase-two biomarker data provides a valid estimate. However, when phase-two case and control samples are not representative of their respective populations, can be severely biased. For example, it is common in biomarker study designs that simple random samples of cases are drawn in the second phase, while controls are matched to cases by covariate strata such that control biomarkers are not representative of their population. To take care of the biased sampling, we propose the use of a weighted estimator of AUC based on the idea of inverse-probability weighting (Horvitz and Thompson, 1952). In particular, we construct IPW versions for the numerator and denominator of the empirical AUC estimator, where the contribution of each participant to a case–control pair is weighted by inverse of the *estimated* sampling probability of the participant in phase two:

(2.1)

where and indicate estimated phase-two sampling probabilities for case and control . For continuous covariate, and can be estimated by parametric models such as the logistic regression model. For discrete covariate strata, their empirical estimates can be derived. That is, for a case belonging to stratum , his/her sampling probability is estimated with , the proportion of phase-one cases sampled in phase two from stratum ; similarly one can estimate for control in stratum with .

In Bernoulli sampling, the true sampling probability for each individual is known, but using “estimated” weight can improve efficiency (Web Appendix A&B, see supplementary material available at *Biostatistics* online). This has also been recommended in other problem settings such as the weighted likelihood estimators (Robins *and others*, 1994; Breslow and Wellner, 2007). Intuitively it holds because using known sampling weights only involves data for subjects sampled at phase two but estimation of the weights allows incorporation of phase-one data available for all subjects, e.g., the number of phase-one cases/controls in each strata in scenarios where sampling probability in phase two varies across discrete strata.

Suppose we model sampling probabilities of the biomarker among cases and controls separately with finite-dimensional parameters and . Let and be maximum likelihood estimators. Asymptotic distribution of the IPW AUC estimator based on corresponding estimates of and is stated below in Theorem 1 (proof in Web Appendix B, see supplementary material available at *Biostatistics* online).

Suppose as and for each case and control; then converges asymptotically to a normal random variable with mean 0 and variance

where

are information matrices for estimating and , , , , and .

For marker measured together with , we can similarly estimate its AUC as , and estimate the difference in AUC between the two markers with . Asymptotic distribution of is shown in Theorem 2 (proof in Web Appendix C, see supplementary material available at *Biostatistics* online).

Suppose as and for every case and control; then converges asymptotically to a normal random variable with mean 0 and variance

Previously, many authors have studied inference for comparing AUC between paired markers (Hanley and McNeil, 1983; DeLong *and others*, 1988; Wieand *and others*, 1989; Obuchowski and McClish, 1997). These tests were developed, however, for scenarios where cases and controls are randomly sampled from their respective distributions, and thus are not applicable for settings when there is biased sampling associated with cases and/or controls. In contrast, IPW estimator of and its analytical variance as presented in Theorem 2 can be used to construct Wald tests for equal AUC between markers, as will be shown later in simulation studies.

Now we consider the finite-population stratified sampling design, the design commonly used in biomarker studies. Again suppose cases and controls among phase-one samples are allocated into and strata, respectively, with number and in each stratum. Fixed numbers of cases and controls are then sampled in phase two from these covariate strata to measure the biomarker . Sampling fractions and can be random.

Let be the stratum indicator among cases taking unique values and be the stratum indicator among controls taking unique values . Compute and . In finite-population stratified sampling, sampling probability of a case or control is constant within their corresponding stratum, i.e., for case in stratum and for control in stratum . The IPW estimator of AUC for marker (2.1) can be equivalently represented as

(2.2)

Suppose, as , sampling fractions for cases among stratum converge with , and sampling fractions for controls among stratum converge with . Then the asymptotic variance of the IPW AUC estimator (2.2) in finite-population stratified sampling is identical to the asymptotic variance of in Bernoulli sampling, if, in phase two of the latter design, cases in stratum are sampled independently with probability and controls in stratum are sampled independently with probability , with sampling probabilities estimated empirically. A proof is given in Web Appendix C (see supplementary material available at *Biostatistics* online). The equality in asymptotic variance of between the two designs can be similarly derived. The same argument on efficiency of weighted likelihood estimators in Cox regression comparing the two designs was made earlier (Breslow and Wellner, 2007).

In this section, we conduct simulation studies to investigate performance of the proposed IPW estimators of AUC and . We consider a binary disease outcome with prevalence in the population. Let be a continuous covariate that follows the standard normal distribution among controls and among cases . Let be a discrete covariate stratum derived from with three levels: if , if , and if , where is the CDF of the standard normal distribution. We consider two biomarkers and , where are jointly normally distributed conditional on , with , , and the correlations between and , between and , and between and , respectively, conditional on . Among controls, and each follows the standard normal distribution. Among cases, follows and follows . The ROC curve based on and individually is thus with and with .

We generate data from two-phase studies. In the first phase, subjects are randomly sampled from the population, whose and values are measured. In the second phase, we considered both Bernoulli sampling and finite-population stratified sampling of cases and controls for measuring markers and , assuming that they are measured on the same set of subjects. In Bernoulli sampling, cases are sampled independently with a constant probability , and controls are sampled independently with a probability that depends on the stratum . In particular, the sampling probability for a control in stratum equals . This ensures that, on average, biomarkers are measured on equal numbers of cases and controls within each stratum. In finite-population stratified sampling, cases are sampled without replacement, and then within each stratum, the same number of controls as cases in that stratum are drawn without replacement. This type of sampling design where simple random samples of cases are drawn in the second phase while controls are matched to cases by covariate strata is common in biomarker research. We also investigate other scenarios where both case and control sampling probabilities in phase two vary across strata. The comparative performance of various estimators is similar to the results we will present below (results omitted).

Based on 5000 Monte Carlo simulations in each setting, we evaluate performance of AUC estimators for individual markers. We compute , and with sampling probabilities estimated empirically among cases and among controls conditional on sampling strata. These estimators are compared with respect to bias, efficiency, coverage of 95% Wald confidence intervals (CIs) based on analytical variance estimates, and the power to test . We also evaluate performance of corresponding estimators for . We compare Wald tests based on and the common Delong–Delong test (DeLong *and others*, 1988) with respect to type-I error rate and power for testing . The latter is implemented in the R package pROC (Robin *and others*, 2011).

Table Table11 gives performance of estimators for , with , , and the constant case sampling rate , for both Bernoulli sampling and finite-population stratified sampling. For both designs, the empirical AUC estimator is biased (with 5–15% relative bias), while has minimum bias. While negative biases for empirical AUC estimator were observed for the particular simulation settings presented here, in general this estimator can have both positive and negative biases depending on the setting. Coverage of 95% Wald CIs for is close to the nominal level, while the CIs based on the empirical AUC has an undercoverage problem. The Wald test for based on has type-I error close to the nominal level, while the test based on empirical AUC has inflated type-I error, the inflation getting worse with the increase in sample size. The estimators in the two different sampling designs have similar variances.

Table Table22 shows performance of estimators for both types of sampling designs for settings where and have the same variance and same correlation with conditional on : and , where we have , , and . The IPW estimator has good performance: minimum bias, coverage of 95% CI and type-I error for testing the equivalence in AUC close to nominal level. When markers and have exactly the same distribution (consequently same ROC curve and AUC), the empirical estimator of is also unbiased: the biases in and are equal and thus cancel out, due to the equality in the distribution of the two markers and in their correlation with the matching stratum. The coverage of its 95% CI is close to the nominal level. Type-I error for testing the equivalence in AUC using the Delong–Delong test is also close to the nominal level. When and , the empirical estimator has small bias, with a magnitude much smaller compared with that of the estimator; its 95% CI has good coverage when sample size is small but slight undercoverage when sample size gets large . Despite the bias in empirical , the Delong–Delong test for equivalence in AUC can have advantage in power compared with the IPW estimator in this particular setting, due to the positive bias in AUC difference.

Table Table33 shows results of estimators for settings again with equal variance between and conditional on , i.e., . However, unlike Table Table22 where both markers have the same correlation with the covariate stratum, here we fix to be 0.5 but vary from 0.4 to 0.1. When markers and have the same distribution and AUC, the empirical estimator of is biased because the magnitude of bias is different between and due to the difference in correlation between each marker and the covariate stratum. Bias is also observed when . Corresponding 95% CI based on has an undercoverage problem. The Delong–Delong test also has an inflated type-I error rate. The inflation gets more severe as the difference between and increases: type-I error can become similar to power for some settings in Table Table33 or become even larger than power in some other constructed settings (details omitted). This is due to the bias in the empirical estimator such that the observed difference between two markers equivalent in AUC can appear similar or even larger compared with the observed difference between two markers that differ in AUC. In practice, when difference in correlations between marker and stratification variable exists, its magnitude is likely on the small to medium side, and thus we expect some inflation of type-I error when applying Delong–Delong's test but not extreme. In contrast, the IPW estimator of are approximately unbiased with coverage of 95% CIs close to the nominal level; corresponding Wald tests for equivalence in AUC between markers have well-controlled type-I error rates.

Table Table44 presents results for settings with . That is, the variability of marker is larger than that of among cases. Note that when , we have , although the two markers can have different ROC curves when . When this happens, the Delong–Delong test has an inflated type-I error, even when correlation with the stratification variable is the same for both markers; whereas the Wald test based on the IPW estimator of has a well-controlled type-I error rate. For both the scenarios with and , the empirical estimator is biased; corresponding 95% CI undercovers the true parameter value. In contrast, the IPW estimators have minimum bias and good coverage of 95% CIs.

In Web Appendix H (see supplementary material available at *Biostatistics* online), we also present additional simulation results when biomarkers follow gamma distributions conditional on disease status. The conclusion regarding the performance of the IPW and the empirical AUC estimators is similar to the bi-normal marker model.

The analytical variance formula we developed in Section 2 will be valuable to biomarker researchers for studying the impact of the sampling scheme on the efficiency of biomarker performance estimators. We demonstrate that using an example comparing two designs with the same number of participants measuring disease outcome, covariate, and biomarker. The setting is similar to that in Section 3, with disease prevalence . Covariate and biomarker are bivariate normal with correlation conditional on . Among controls and are each standard normal; among cases and . Let be a binary covariate stratum derived from , with if and otherwise. We compare two sampling designs. Both are two-phase studies with a random cohort sample of size drawn in the first phase. In the second phase, both designs include all cases from a phase-one sample, i.e., ; a simple random sample of controls of size are drawn without replacement in Design 1, simple random samples of controls with the same number as cases are drawn without replacement from each stratum in Design 2. In Design 1, phase-two case and control samples are representative of their respective distributions; thus the empirical estimator of AUC is valid and is considered for this design. In contrast, phase-two case and control samples in Design 2 are not representative of their respective distributions; thus we use the proposed with empirically estimated sampling weights conditional on sampling strata for Design 2.

Following Theorem 1 and Section 2.4, asymptotic variance of in Design 2 equals

(4.1)

Since in Design 1 can be thought of as an estimator where there is only one sampling stratum for cases and for controls, its asymptotic variance can be similarly derived as

(4.2)

for . The result also follows DeLong *and others* (1988).

Figure Figure11 shows the relative asymptotic efficiency of in Design 2 versus in Design 1 for two different values, as changes from 0.9 to 0.9. A common U-shape is observed, where Design 2 is more efficient as the magnitude of the correlation increases, whereas Design 1 can be more efficient when the correlation is small. Comparing variance formulae of the two estimators, their difference arises from two components: (i) the difference between and , and (ii) the difference between and . Note that when and are not correlated among controls, the difference in variance between the two estimators is solely due to component (i): simple random sampling without replacement from controls is more efficient since the weighted average of for is larger than in our example. As and become more correlated, variance of conditional on the covariate stratum becomes smaller compared with its variance among all controls, and thus stratified sampling is more efficient. This is consistent with the use of stratified sampling in survey sampling as a possible way to increase efficiency in estimating parameters such as the population total when a heterogeneous population can be divided into strata with homogeneous units (Cochran, 2007). The same pattern can be observed when biomarkers conditional on disease status follow gamma distributions (Web Supplementary Figure S3, available at *Biostatistics* online). In practice, researchers can evaluate efficiency of different sampling schemes based on prior knowledge in the relationship between marker, covariate, and disease.

We illustrate the proposed methodology for estimating and comparing AUC with a real example of biomarker study from the RV144 Thailand HIV vaccine trial. The trial included 16402 participants aged 18–30 who were 1:1 randomized into a vaccine and a placebo arm. Among vaccine recipients in the RV144 trial who were not yet infected at week 26, an immune response study was conducted to assess vaccine-induced immune response based on peak immunogenicity at week 26 following a finite-population stratified sampling design (Haynes *and others*, 2012). Around 1.8% vaccinees were censored before the end of the study and were treated as non-infected for subsequent sampling. The study includes all 41 vaccinees infected after week 26 visits. The control vaccinees were selected from a stratified random sample of vaccinees free of HIV-1 infection at 42 months, within strata constructed by gender, number of vaccinations received, and per-protocol status, with five times the number of cases within each stratum.

Two of the primary assays studied, the binding of IgG antibodies to variable regions 1 and 2 (V1V2) of the gp120 Env, and the binding of plasma IgA antibodies to Env, were found to correlate significantly with infection risk (Haynes *and others*, 2012). Here we evaluate and compare AUC of the two markers. In this application, will be HIV infection at 42 months, and and are V1V2 and IgA measures at week 26, respectively.

First, the naive empirical AUC estimate is 0.573 (95% ) for V1V2, and 0.596 (95% ) for IgA. There is around 4% relative increase in AUC for IgA compared with V1V2, although the difference is not statistically significant (-value0.774 based on the Delong–Delong test). With empirically estimated sampling weights conditional on the matching stratum, the equals 0.588 (95% ) for V1V2 and 0.589 (95% ) for IgA. The difference in AUC between the two markers becomes even smaller after accounting for the sampling scheme with a -value of 0.997 based on the Wald test. In this example, adjusting for the sampling scheme makes a small difference in point estimates due to the relatively small variability in sampling weights of controls across strata; the observation that the IPW estimate of the AUC difference between IgA and V1V2 is smaller than the empirical estimate is consistent with observations made in simulation studies, where bias in can make two markers look more different when they have equal AUC.

In the second example, we consider a prospective study conducted by the Early Detection Research Network aimed to assess a urine biomarker for prostate cancer, the Prostate Cancer Antigen 3 (PCA3) (Deras *and others*, 2008). This study involved 570 men enrolled at four North American sites scheduled for prostate biopsy, with a prostate cancer prevalence of 36.6%. Urinary PCA3 and serum PSA (prostate-specific agent, ) are obtained from every participant using specimens collected before biopsy. Each biomarker is log-transformed and standardized to have mean zero and variance 1 among subjects without prostate cancer. Among those with prostate cancer, PCA3 and PSA have mean 0.64 and 0.42, and variance 1.0 and 0.83, respectively. Pearson correlations with age are 0.40 for PCA3 and 0.15 for PSA among those without prostate cancer, and 0.24 for PCA3 and 0.28 for PSA among those with prostate cancer. Increase in age also appears to be associated with increased risk of prostate cancer.

To illustrate application of our methodology, we perform a finite-population stratified sampling based on age strata generated by the first and second tertiles of age distribution among controls, i.e., age 60, 60–67, 67 years. We randomly sampled 120 cases and then sampled the same number of controls as cases within each age stratum. Based on the case–control sample, the empirical AUC estimate is 0.553 (95% ) for PSA and 0.645 (95% ) for PCA3, with a difference not statistically significant ( with 95% CI (0.004, 0.188), and -value0.061 based on the Delong–Delong test). In contrast, with empirically estimated sampling weights conditional on the stratum equals 0.553 (95% ) for PSA and 0.671 (95% ) for PCA3, with a significant difference detected ( with 95% CI (0.017, 0.206), -value0.021 based on the Wald test). Note that there is a significant difference between the two markers if we estimate the empirical AUC based on the full cohort sample (, 95% , -value0.0003 based on Delong–Delong's test). Thus, with the case–control sample only, we would have identified the significant difference with the weighted estimator, but have missed it using the empirical estimator.

In this paper, we developed methods for assessing a biomarker's classification accuracy for a binary disease as characterized by the area under the ROC curve in two-phase sampling designs. Finite-population stratified sampling of biomarkers from a phase-one cohort representative of the target population has become increasingly common in recent years with the availability of large clinical trials and cohort studies. But statistical methods to properly handle the design components when assessing biomarkers are not yet well developed. Empirical area under the ROC curve estimated from the phase-two biomarker samples was often reported in applied literatures even in the presence of biased sampling of cases and/or controls. We showed in this paper that empirical estimators of AUC ignoring the biased sampling scheme can lead to severely biased estimates of classification accuracy and invalid inference for comparing biomarkers. We investigated an inverse-sampling-probability-weighted estimator that achieves unbiased estimation of AUC and developed asymptotic variance formulae applicable to inference in finite-population stratified sampling, through its connection with the IPW estimator in the standard Bernoulli sampling design. The analytical variance formulae we developed will provide valuable guidance to biomarker researchers on optimizing the sampling scheme in designing future biomarker studies, in order to achieve better efficiency in evaluating biomarkers for classification accuracy. For Bernoulli sampling, we observed analytically and numerically that using estimated sampling weights is more efficient even when the true sampling weights are known by design (Web Supplementary Appendix A–C, F–H, see supplementary material available at *Biostatistics* online). In particular, using estimated weights can lead to much improved efficiency for estimating performance of a single marker. In contrast, the improvement due to weights estimation is relatively minor for comparing performance of paired markers; when sample size is small, the CI of the estimator based on known sampling weights can have slightly better coverage than that based on estimated weights.

In this paper, we considered simple design-based weights, the estimation of which does not require more information beyond the sampling strata. In etiology studies, it has been shown that when auxiliary variables are available from a phase-one cohort, they can be used to further adjust the weights for potential efficiency gains in estimating the disease odds ratio or hazard ratio (Breslow *and others*, 2009). It is interesting in future work to investigate the impact of weight adjustments using auxiliary variables on biomarker performance measure such as the AUC. Finally, the inverse-probability-weighting methods can be naturally applied to other performance measures such as the points on the ROC curve and the partial area under the ROC curve.

Supplementary Material is available at http://biostatistics.oxfordjournals.org.

This work was supported by the U.S. National Institutes of Health grants R01 GM106177-01 and R01 GM54438. Funding to pay the Open Access publication charges for this article was provided by NIH grant R01 106177.

*Conflict of Interest:* None declared.

- Bamber D. (1975). The area above the ordinal dominance graph and the area below the receiver operating characteristic graph. Journal of Mathematical Psychology 124, 387–415.
- Breslow N. E., Lumley T., Ballantyne C. M., Chambless L. E., Kulich M. (2009). Improved Horvitz–Thompson estimation of model parameters from two-phase stratified samples: applications in epidemiology. Statistics in Biosciences 11, 32–49. [PMC free article] [PubMed]
- Breslow N. E., Wellner J. A. (2007). Weighted likelihood for semiparametric models and two-phase stratified samples, with application to cox regression. Scandinavian Journal of Statistics 341, 86–102. [PMC free article] [PubMed]
- Cai T., Zheng Y. (2011). Evaluating prognostic accuracy of biomarkers in nested case–control studies. Biostatistics 131, 89–100. [PMC free article] [PubMed]
- Cochran W. G. (2007) Sampling Techniques. John Wiley & Sons, New York.
- DeLong E. R., DeLong D. M., Clarke-Pearson D. L. (1988). Comparing the areas under two or more correlated roc curves: a nonparametric approach. Biometrics 44, 837–845. [PubMed]
- Deras I. L., Aubin S. M. J., Blase A., Day J. R., Koo S., Partin A. W., Ellis W. J., Marks L. S., Fradet Y., Rittenhouse H.
*and others*(2008). Pca3: a molecular urine assay for predicting prostate biopsy outcome. The Journal of Urology 1794, 1587–1592. [PubMed] - Hanley J. A., McNeil B. J. (1983). A method of comparing the areas under receiver operating characteristic curves derived from the same cases. Radiology 1483, 839–843. [PubMed]
- Haynes B. F., Gilbert P. B., McElrath M. J., Zolla-Pazner S., Tomaras G. D., Alam S. M., Evans D. T., Montefiori D. C., Karnasuta C., Sutthent R.
*and others*(2012). Immune-correlates analysis of an HIV-1 vaccine efficacy trial. New England J. of Medicine 36614, 1275–1286. [PMC free article] [PubMed] - He H., Lyness J. M., McDermott M. P. (2009). Direct estimation of the area under the roc curve in the presence of verification bias. Statistics in Medicine 283, 361–376. [PMC free article] [PubMed]
- Horvitz D. G., Thompson D. J. (1952). A generalization of sampling without replacement from a finite universe. Journal of the American Statistical Association 47260, 663–685.
- Janes H., Pepe M. S. (2009). Adjusting for covariate effects on classification accuracy using the covariate-adjusted roc curve. Biometrika 962, 371–382. [PMC free article] [PubMed]
- Manski C. F., Lerman S. R. (1977). The estimation of choice probabilities from choice based samples. Econometrica: Journal of the Econometric Society, 458, 1977–1988.
- Neyman J. (1938). Contribution to the theory of sampling human populations. Journal of the Acoustical Society of America 33201, 101–116.
- Obuchowski N. A., McClish D. K. (1997). Sample size determination for diagnostic accuracy studies involving binormal roc curve indices. Statistics in Medicine 1613, 1529–1542. [PubMed]
- Pepe M. S., Fan J., Seymour C. W., Li C., Huang Y., Feng Z. (2012). Biases introduced by choosing controls to match risk factors of cases in biomarker research. Clinical Chemistry 588, 1242–1251. [PMC free article] [PubMed]
- Pepe M. S., Feng Z., Janes H., Bossuyt P. M., Potter J. D. (2008). Pivotal evaluation of the accuracy of a biomarker used for classification or prediction: standards for study design. Journal of the National Cancer Institute 10020, 1432–1438. [PubMed]
- Robin X., Turck N., Hainard A., Tiberti N., Lisacek F., Sanchez J., Müller M. (2011). pROC: an open-source package for r and s+ to analyze and compare roc curves. BMC Bioinformatics 121, 77. [PMC free article] [PubMed]
- Robins J. M., Rotnitzky A., Zhao L. P. (1994). Estimation of regression coefficients when some regressors are not always observed. Journal of the Acoustical Society of America 89427, 846–866.
- Wieand S., Gail M. H., James B. R., James K. L. (1989). A family of nonparametric statistics for comparing diagnostic markers with paired or unpaired data. Biometrika 763, 585–592.

Articles from Biostatistics (Oxford, England) are provided here courtesy of **Oxford University Press**

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |