Suppose that on each subject *i* of a random sample of *n* patients, we measure a diagnostic test result *Y*_{i} which can be continuous or ordinal, and a covariate vector *V*_{i} of health and demographic variables. Let *X*_{i} be the indicator of true disease status of patient *i, X*_{i} = 1 if patient *i* has the disease of concern, and *X*_{i} = 0 otherwise. Suppose that *X*_{i} is missing in a subset of the study participants because not all subjects undergo the definitive assessment of disease. Let *R*_{i} = 1 if patient *i* is sent to verification (i.e. if *X*_{i} is observed) and *R*_{i} = 0 otherwise. The vectors (*X*_{i}, Y_{i}, V_{i}, R_{i}), *i* = 1, ..., *n* are independent identically distributed (i.i.d) copies of a random vector (*X, Y, V, R*). Under our model the observed data are comprised of *n* i.i.d copies *O*_{i} of the random vector *O* = (*R, c* (*R, X, Y, V*)) where *c* (1, *X, Y, V*) = (*X, Y, V*) and *c* (0, *X, Y, V*) = (*Y, V*).

Suppose that a subject

*i* were to be classified as diseased if

*T*_{i} =

*I* (

*Y*_{i} > c) is equal to 1 and as non-diseased if

*T*_{i} = 0, for a given constant

*c*. The sensitivity,

*Se*(

*c*) = Pr(

*Y > c|X* = 1), and specificity,

*Sp*(

*c*) = Pr(

*Y < c|X* = 0), would then be the probability of diagnosing a diseased and a healthy person correctly, respectively. For notational convenience, in what follows the

*c* will on occasions be dropped. The theoretical ROC curve is the plot of sensitivity against 1-specificity for all possible values of c. Consequently estimation of the ROC curve follows immediately from estimation of the

*θ*(

*c*) = (

*Se*(

*c*),

*Sp*(

*c*)) pairs. Supposing that all patients are verified (i.e.

*R*_{i} = 1), the empirical estimator

is consistent for

*θ*(

*c*). If not all subjects undergo verification, i.e. if

*R*_{i} is not 1 for all

*i*, but the process of selection to verification is completely random, i.e. if

*R*_{i} and

*X*_{i} are independent, then

calculated from verified subjects only remains consistent for

*θ*(

*c*). However, if

*R*_{i} and

*X*_{i} are correlated, then this naive estimator will lead to inconsistent estimates of

*θ*.

The key issue that complicates the analysis is that both

*Se*(

*c*) and

*Sp*(

*c*) are unidentified from the observed data

*O*_{i}, i = 1, ...,

*n*. That is, there exist many curves

*Se*(

*c*) and

*Sp*(

*c*),

*c* **R**, compatible with the distribution of

*O*_{i}, so that even in the ideal situation in which the distribution of the observed data were perfectly known, there would remain uncertainty about the true ROC curve. To identify the ROC curve and associated summary measures one needs to make untestable assumptions, i.e. assumptions that cannot be rejected by any statistical test no matter how large

*n* is.

Rotnitzky *et al.* (2006) argued that the area under the ROC curve is identified under the following untestable assumption,

where

*q* (

*Y, V*) is an arbitrary specified (i.e. known) function and

*h* (

*Y, V*) is an arbitrary but unknown function. Indeed, this assumption suffices also to identify

*θ*(

*c*) for any fixed

*c*.

Application of Bayes rule shows that (

2.2) holds for some

*h* (

*Y, V*) if and only if

where

*c* (

*Y, V*) =

*E* {exp (

*q* (

*Y, V*)

*X*) |

*R* = 1,

*Y, V*} (

Rotnitzky *et al.*, 2006). The function

*q* (

*Y, V*) measures the degree of the residual association between

*R* and

*X* within levels of

*Y* and

*V*. For example, the choice

*q* (

*Y, V*) = −1 indicates that the residual association is constant within levels of (

*Y, V*) and such that among subjects with identical values of (

*Y, V*), the odds of being sent to verification is exp (1) = 2.71 times larger for diseased subjects than for non-diseased subjects. The choice

*q* (

*Y, V*) = 0 for all

*V* and

*Y* corresponds to the assumption that the missing disease status is missing at random. In general, when

*q* (

*Y, V*) ≠ 0, the selection process is said to be non-ignorable.

We refer to the model that assumes just (

2.2) with specified

*q* (

*Y, V*) and

*h* (

*Y, V*) unknown (or equivalently

equation (2.3) with specified

*q* (

*Y, V*)) as model

. Arguing as in

Scharfstein *et al.* (1999), it can be shown that imposing a given choice of

*q* does not restrict the observed data distribution. That is, model

is a non-parametric model for the distribution of

*O*. Since all values of the selection bias parameter determine the same model for the observed data distribution, the selection bias parameter is untestable, because any

*q* will perfectly fit the observed data. Following

Robins *et al.* (2000) we therefore recommend conducting inference about

*θ*(

*c*) by repeating the estimation under different plausible choices for the selection bias function

*q* as a form of sensitivity analysis to different degrees of selection bias. This raises the question of how to choose the selection bias functions in practice. We suggest that one chooses a collection of simple selection bias functions indexed by one or two parameters

*β* that are to be varied in a sensitivity analysis, with values of the parameters equal to zero corresponding to the function

*q* = 0. For example, the choice

*q* (

*Y, V*) =

*β* is tantamount to assuming that the odds of verification of diseased vs non-diseased subjects is constant across all values of the marker and the auxiliary

*V* and equal to exp (–

*β*). This is the choice used in our simulation study in Section 4. One could consider the choice

*q* (

*Y, V*) =

*β*_{0} +

*β*_{1}*Y* if the investigator believed that the odds of verification of diseased vs non-diseased subjects is modified by the values of

*Y*, this modification being monotone in

*Y*.

The identity (

2.3) implies that

Consequently, writing

*Se*(

*c*) as

*E* {

*E* (

*X|Y, V, R*)

*T*} /

*E* {

*E* (

*X|Y, V, R*)} and

*Sp*(

*c*) as

*E* {[1 –

*E* (

*X|Y, V, R*)] (1 –

*T*)} /

*E* {[1 –

*E* (

*X|Y, V, R*)]} we obtain that under model

and

Likewise, letting

*π* (

*Y, V*) denote the quantity (1 + exp {

*h* (

*Y, V*) +

*q* (

*Y, V*)

*X*})

^{−1}, the identity (

2.2) implies that

*P* (

*R* = 1|

*X, Y, V*) =

*π* (

*Y, V*). Consequently, under modelg

,

*Se*(

*c*) and

*Sp*(

*c*) can be expressed also as

Model

implicitly assumes that all subjects have a positive probability of being selected for verification regardless of the values of their test results and covariates. See

Rotnitzky *et al.* (2006) for a discussion of the possibility of relaxing this condition. For technical reasons related to the finiteness of the variance of the estimators that we will propose later, we further need to assume that

Unfortunately, though sufficient for identification, model

is insufficient for estimation of

*θ*(

*c*) due to the curse of dimensionality. Specifically, expressions (

2.4), (

2.5) and (

2.6) imply that if one were to estimate

*θ*(

*c*) under model

, one would need to non-parametrically estimate either the function

*h* (

*Y, V*) or the conditional expectation

*E* (

*X|Y, V, R* = 1) using smoothing techniques, a practically unfeasible task when the dimension of

*V* is large. Nevertheless, the expressions (

2.4), (

2.5) and (

2.6) suggest two alternative dimension reducing strategies for estimating

*θ*(

*c*).

The first option is to assume that the unknown function

*h* (

*Y, V*) follows a parametric model

where

*h* (

*Y, V*;

*γ*) is smooth in

*γ* and

*γ*^{0} is an unknown

*p*_{h} × 1 finite dimensional parameter. Assumption (

2.8) and assumption (

2.2) with specified

*q* (

*Y, V*) define a semiparametric model for the observed data which for ease of reference we refer to as

. Note that this model imposes just a, possibly non-linear, logistic regression model on the selection probabilities with offset

*q* (

*Y, V*)

*X*. Expressions (

2.6) imply that one can obtain a consistent and asymptotically normal estimator estimator of

*θ*(

*c*), throughout referred to as inverse probability weighted (IPW) estimator, under model

*B* (

*q*) by replacing

*X*_{i} in (

2.1) with

*X*_{i}, where

*π* (

*Y, V*;

*γ*) = (1 + exp {

*h* (

*Y, V*;

*γ*) +

*q* (

*Y, V*)

*X*})

^{−1} and

is a consistent estimator of

*γ*^{0}. Unfortunately, even though under model

,

*R* follows a logistic regression on

*Y, X* and

*V*, we cannot find

from a standard logistic regression fit with offset because the offset

*q* (

*Y, V*)

*X* is not observed when

*R* = 0. Instead, following

Rotnitzky *et al.* (2006) we can compute

as the solution to

where

and

*u* (

*Y, V*) is an arbitrary, user specified, column vector function of the same dimension as

*γ*. The choice of

*u* impacts the efficiency with which we estimate

*γ*. Using the theoretical results in

Rotnitzky and Robins (1997), it can be shown that if

*u* (

*Y, V*) is equal to

then the solution

of (

2.9) has asymptotic variance equal to the semiparametric variance bound for estimators of

*γ* under model

. Note, in particular, that for the choice

*q* (

*Y, V*) = 0,

*B* (

*γ*;

*u*) reduces to

which, at

*γ*^{0}, is equal to the score, i.e. the derivative of the log-likelihood for

*γ*, in the logistic regression model logit

*P* (

*R* = 1|

*Y, V*) = –

*h* (

*Y, V*;

*γ*).

A second option is to assume that

*E* (

*X|Y, V, R* = 1) = Pr (

*X* = 1|

*R* = 1,

*Y, V*) follows a parametric model

where

*m* (

*Y, V*;

*μ*) is a known function, smooth in

*μ*, and

*μ*^{0} is an unknown

*p*_{m} × 1 parameter vector. Assumption (

2.11) is a model for the disease probabilities among the verified subjects. The model defined by (

2.3) with specified

*q* (

*Y, V*) and (

2.11) is a semiparametric model for the law of the observed data which we will denote with

. The Semi-Parametric Maximum Likelihood (SPML) estimator of

*θ*(

*c*) under model

is obtained by replacing each

*X*_{i} in (

2.1) with

, where

with Pr (

*X* = 1|

*R* = 1,

*Y, V*;

*μ*) = exp {

*m* (

*Y, V*;

*μ*)} / [1 + exp {

*m* (

*Y, V*;

*μ*)}], and

solves the score equations

with

.

Neither option is entirely satisfactory because option 1 results in inconsistent estimation if model

is incorrect and option 2 results in inconsistent estimation if model

is misspecified. There is yet a third option which provides an estimator consistent for

*θ*(

*c*) under either model

or model

but not necessarily both. Specifically, define

where

Let

be the estimator of

*θ* obtained by replacing each

*X*_{i} in (

2.1) with

. That is,

The estimator

is said to be double-robust because, as stated in Theorem 1 below, it is consistent and asymptotically normal for

*θ* so long as model

, condition (

2.7) and one of the models (

2.11) or (

2.8) holds, but not necessarily both. The key to the consistency of

lies in the fact that if model

holds then, if model (

2.11) additionally holds,

or if model (

2.8) additionally holds,

In contrast to the first two estimators IPW and SPML, the double-robust estimator

gives the analyst two chances, instead of only one, to get nearly correct inference. Of course, there can be an efficiency price to using the DR estimator. If the disease regression model (

2.11) is correct both the DR estimator and the SPML estimator will be consistent but the DR estimator will generally have larger (asymptotic) variance. Clearly, the efficiency gains over

are obtained at the cost of potential severe bias.

An interesting remark is that, invoking the theoretical results of

Rotnitzky and Robins, 1997, it can be shown that if we use

*u*_{opt} (

*Y, V*) to compute

then the

that uses this

is locally semiparametric efficient under model

at the local model

. This means that, if both models

and

are correct,

has asymptotic variance that attains the semiparametric variance bound for estimators bof

*θ*(

*c*) under model

. This

is unfeasible because to compute

*u*_{opt} (

*Y, V*) we need to know

*γ*^{0} and, when

*q*B(

*Y, V*) ≠ 0, we also need the conditional probabilities

*P* (

*X* = 1|

*R* = 1,

*Y, V*). However, a feasible double-robust estimator that retains the local efficiency properties of the unfeasible

can be obtained with the following two-stage algorithm. At the first stage we compute

and a preliminary estimator

that solves (

2.9) using any

*u* (

*Y, V*). We then calculate the function

(

*Y, V*) defined like

*u*_{opt} (

*Y, V*) in (

2.10) but with

replacing

*γ*^{0} and with the bexpectations in the numerator and denominator computed under

*P* (

*X* = 1|

*R* = 1,

*Y, V*;

). At the second stage, we compute

solving (

2.9) with

*u* replaced by

and finally compute

using this

. This remark raises the interesting point of how much efficiency is gained by the two-stage locally efficient

compared to the single stage

computed using

and the preliminary

. Our simulation study in Section 4 explores this issue using the natural, easy to compute, choice of

*u* (

*Y, V*) =

*h* (

*Y, V*;

*γ*) /

*γ* for the preliminary estimator

. For this choice, the theoretical results of

Rotnitzky and Robins, 1997, imply that the single stage and the two-stage estimators are equally asymptotically efficient if

*q* (

*Y, V*) = 0 but the two-stage estimator has strictly smaller asymptotic variance if

*q* (

*Y, V*) ≠ 0. For ease of reference, we call the single stage estimator

that uses

with

*u* (

*Y, V*) =

*h* (

*Y, V*;

*γ*) /

*γ* the DR1 estimator and we call the two-stage estimator

using this

as the preliminary estimator of

*γ*, the DR2 estimator.

Alonzo and Pepe (AP), 2005, derived a locally-efficient double-robust estimator that, except for a minor difference in the computation of

, is computed identically to our

*DR*2 estimator in the case

*q* = 0. Note that when

*q* = 0, the function

*u*_{opt} (

*Y, V*) reduces to

. The distinction in the computation of

between our procedure and that of AP is that our

solves (

2.9) with

*u* (

*Y, V*) replaced by

while AP's

solves (

2.9) but with

*u* (

*Y, V*) replaced by the function

*u* (

*Y, V; γ*) =

*π* (

*Y, V; γ*) ×

*h* (

*Y, V; γ*) /

*γ*, depending on the unknown

*γ*. It is easy to show that both estimators of

*γ* are asymptotically equivalent and efficient. Consequently, when

*q* = 0, our DR2 estimator is asymptotically equivalent to AP's estimator of

*θ*.

The following Theorem whose proof is sketched in the

Appendix establishes the asymptotic properties of the estimator

computed using any fixed function

*u* (

*Y, V*) . In particular, it establishes the double-robustness property of

. It also provides a consistent variance estimator when model

, condition (

2.7) and either model (

2.11) or model (

2.8) holds. Further details can be found in

Fluss (2006). To state the theorem define

*θ*_{1} =

*Se* (

*c*) and

*θ*_{2} =

*Sp* (

*c*),

*κ* =

*E* (

*X*),

where

*γ** and

*μ** are the probability limits of

and

. Note that under standard regularity conditions for L-estimators (see for example,

van der Vaart (1998)),

*γ** =

*γ*^{0} when model (

2.8) holds and

*μ** =

*μ*^{0} when model (

2.11) holds.

**Theorem 1 ***Suppose that model*
and (

2.7) hold. Under standard regularity conditions for L-estimators, if model (

2.8) and (

2.7) hold or if model (

2.11) holds,

, where Ω = Cov(M). Ω can be consistently estimated with

*where*
2.1 Illustrative Example

To illustrate our double-robust estimator of the ROC curve we used simulated data following the model used by

Rotnitzky *et al.* (2006) in their simulation study. We first generated for each of n=200 subjects a binary disease indicator

*X* ~

*Bernoulli*(0.3), such that approximately 30% of subjects were diseased. We further simulated a continuous marker

*Y* from the model

*Y*|

*X* =

*x* ~

*N* (0.37

*x*, 0.5

^{2}) , a binary covariate

*V* =

*I* (

*X* – 0.3 +

*ε* > 0) (and hence conditionally independent of

*Y*) where

*ε* ~

*N* (0, 0.3

^{2}) , and a response indicator

*R* following the selection for verification model (

2.8) with

*h* (

*Y, V*) =

*γ*_{0} +

*γ*_{1}*Y* +

*γ*_{2}*V* and

*q* (

*Y, V*) =

*β* with (

*γ*_{0},

*γ*_{1},

*γ*_{2}) = (3, −2, −3) and

*β* = −1. The values of (

*γ*_{0},

*γ*_{1},

*γ*_{2},

*β*) were chosen so that roughly 70% of the

*X*′

*s* would be missing. The covariate

*V* was purposely chosen to have a large correlation (roughly 0.75) with

*X* so that the bias of the estimator based on the verified patients only, be large. The empirical DR1 ROC curves were obtained by estimating

*Se* (

*c*) and

*Sp* (

*c*) as described above for every

*c* {

*y*_{1}, ...,

*y*_{n}} where the

*y*_{i}'s represent the 200 simulated values of

*Y*. illustrates several estimated ROC curves using one simulated sample along with the true and unknown ROC curve (smooth line). For comparison we added the biased estimator (NAIVE) using formula (

2.1) based on verified subjects only and the (unfeasible) complete data estimator (COMPLETE) using the entire sample. We also added the estimator assuming missing at random (MAR) using

*q* = 0. Notice the large bias of the NAIVE curve whereas the DR1 is closer to both the complete data estimator and true ROC. DR2 is very similar to DR1 and is not exhibited. The MAR curve is farther from the true ROC than DR1.