Suppose that on each subject i of a random sample of n patients, we measure a diagnostic test result Yi which can be continuous or ordinal, and a covariate vector Vi of health and demographic variables. Let Xi be the indicator of true disease status of patient i, Xi = 1 if patient i has the disease of concern, and Xi = 0 otherwise. Suppose that Xi is missing in a subset of the study participants because not all subjects undergo the definitive assessment of disease. Let Ri = 1 if patient i is sent to verification (i.e. if Xi is observed) and Ri = 0 otherwise. The vectors (Xi, Yi, Vi, Ri), i = 1, ..., n are independent identically distributed (i.i.d) copies of a random vector (X, Y, V, R). Under our model the observed data are comprised of n i.i.d copies Oi of the random vector O = (R, c (R, X, Y, V)) where c (1, X, Y, V) = (X, Y, V) and c (0, X, Y, V) = (Y, V).
Suppose that a subject
i were to be classified as diseased if
Ti =
I (
Yi > c) is equal to 1 and as non-diseased if
Ti = 0, for a given constant
c. The sensitivity,
Se(
c) = Pr(
Y > c|X = 1), and specificity,
Sp(
c) = Pr(
Y < c|X = 0), would then be the probability of diagnosing a diseased and a healthy person correctly, respectively. For notational convenience, in what follows the
c will on occasions be dropped. The theoretical ROC curve is the plot of sensitivity against 1-specificity for all possible values of c. Consequently estimation of the ROC curve follows immediately from estimation of the
θ(
c) = (
Se(
c),
Sp(
c)) pairs. Supposing that all patients are verified (i.e.
Ri = 1), the empirical estimator
is consistent for
θ(
c). If not all subjects undergo verification, i.e. if
Ri is not 1 for all
i, but the process of selection to verification is completely random, i.e. if
Ri and
Xi are independent, then

calculated from verified subjects only remains consistent for
θ(
c). However, if
Ri and
Xi are correlated, then this naive estimator will lead to inconsistent estimates of
θ.
The key issue that complicates the analysis is that both
Se(
c) and
Sp(
c) are unidentified from the observed data
Oi, i = 1, ...,
n. That is, there exist many curves
Se(
c) and
Sp(
c),
c
R, compatible with the distribution of
Oi, so that even in the ideal situation in which the distribution of the observed data were perfectly known, there would remain uncertainty about the true ROC curve. To identify the ROC curve and associated summary measures one needs to make untestable assumptions, i.e. assumptions that cannot be rejected by any statistical test no matter how large
n is.
Rotnitzky et al. (2006) argued that the area under the ROC curve is identified under the following untestable assumption,
where
q (
Y, V) is an arbitrary specified (i.e. known) function and
h (
Y, V) is an arbitrary but unknown function. Indeed, this assumption suffices also to identify
θ(
c) for any fixed
c.
Application of Bayes rule shows that (
2.2) holds for some
h (
Y, V) if and only if
where
c (
Y, V) =
E {exp (
q (
Y, V)
X) |
R = 1,
Y, V} (
Rotnitzky et al., 2006). The function
q (
Y, V) measures the degree of the residual association between
R and
X within levels of
Y and
V. For example, the choice
q (
Y, V) = −1 indicates that the residual association is constant within levels of (
Y, V) and such that among subjects with identical values of (
Y, V), the odds of being sent to verification is exp (1) = 2.71 times larger for diseased subjects than for non-diseased subjects. The choice
q (
Y, V) = 0 for all
V and
Y corresponds to the assumption that the missing disease status is missing at random. In general, when
q (
Y, V) ≠ 0, the selection process is said to be non-ignorable.
We refer to the model that assumes just (
2.2) with specified
q (
Y, V) and
h (
Y, V) unknown (or equivalently
equation (2.3) with specified
q (
Y, V)) as model

. Arguing as in
Scharfstein et al. (1999), it can be shown that imposing a given choice of
q does not restrict the observed data distribution. That is, model

is a non-parametric model for the distribution of
O. Since all values of the selection bias parameter determine the same model for the observed data distribution, the selection bias parameter is untestable, because any
q will perfectly fit the observed data. Following
Robins et al. (2000) we therefore recommend conducting inference about
θ(
c) by repeating the estimation under different plausible choices for the selection bias function
q as a form of sensitivity analysis to different degrees of selection bias. This raises the question of how to choose the selection bias functions in practice. We suggest that one chooses a collection of simple selection bias functions indexed by one or two parameters
β that are to be varied in a sensitivity analysis, with values of the parameters equal to zero corresponding to the function
q = 0. For example, the choice
q (
Y, V) =
β is tantamount to assuming that the odds of verification of diseased vs non-diseased subjects is constant across all values of the marker and the auxiliary
V and equal to exp (–
β). This is the choice used in our simulation study in Section 4. One could consider the choice
q (
Y, V) =
β0 +
β1Y if the investigator believed that the odds of verification of diseased vs non-diseased subjects is modified by the values of
Y, this modification being monotone in
Y.
The identity (
2.3) implies that
Consequently, writing
Se(
c) as
E {
E (
X|Y, V, R)
T} /
E {
E (
X|Y, V, R)} and
Sp(
c) as
E {[1 –
E (
X|Y, V, R)] (1 –
T)} /
E {[1 –
E (
X|Y, V, R)]} we obtain that under model
and
Likewise, letting
π (
Y, V) denote the quantity (1 + exp {
h (
Y, V) +
q (
Y, V)
X})
−1, the identity (
2.2) implies that
P (
R = 1|
X, Y, V) =
π (
Y, V). Consequently, under modelg

,
Se(
c) and
Sp(
c) can be expressed also as
Model

implicitly assumes that all subjects have a positive probability of being selected for verification regardless of the values of their test results and covariates. See
Rotnitzky et al. (2006) for a discussion of the possibility of relaxing this condition. For technical reasons related to the finiteness of the variance of the estimators that we will propose later, we further need to assume that
Unfortunately, though sufficient for identification, model

is insufficient for estimation of
θ(
c) due to the curse of dimensionality. Specifically, expressions (
2.4), (
2.5) and (
2.6) imply that if one were to estimate
θ(
c) under model

, one would need to non-parametrically estimate either the function
h (
Y, V) or the conditional expectation
E (
X|Y, V, R = 1) using smoothing techniques, a practically unfeasible task when the dimension of
V is large. Nevertheless, the expressions (
2.4), (
2.5) and (
2.6) suggest two alternative dimension reducing strategies for estimating
θ(
c).
The first option is to assume that the unknown function
h (
Y, V) follows a parametric model
where
h (
Y, V;
γ) is smooth in
γ and
γ0 is an unknown
ph × 1 finite dimensional parameter. Assumption (
2.8) and assumption (
2.2) with specified
q (
Y, V) define a semiparametric model for the observed data which for ease of reference we refer to as

. Note that this model imposes just a, possibly non-linear, logistic regression model on the selection probabilities with offset
q (
Y, V)
X. Expressions (
2.6) imply that one can obtain a consistent and asymptotically normal estimator estimator of
θ(
c), throughout referred to as inverse probability weighted (IPW) estimator, under model
B (
q) by replacing
Xi in (
2.1) with
Xi, where
π (
Y, V;
γ) = (1 + exp {
h (
Y, V;
γ) +
q (
Y, V)
X})
−1 and

is a consistent estimator of
γ0. Unfortunately, even though under model

,
R follows a logistic regression on
Y, X and
V, we cannot find

from a standard logistic regression fit with offset because the offset
q (
Y, V)
X is not observed when
R = 0. Instead, following
Rotnitzky et al. (2006) we can compute

as the solution to
where

and
u (
Y, V) is an arbitrary, user specified, column vector function of the same dimension as
γ. The choice of
u impacts the efficiency with which we estimate
γ. Using the theoretical results in
Rotnitzky and Robins (1997), it can be shown that if
u (
Y, V) is equal to
then the solution

of (
2.9) has asymptotic variance equal to the semiparametric variance bound for estimators of
γ under model

. Note, in particular, that for the choice
q (
Y, V) = 0,
B (
γ;
u) reduces to

which, at
γ0, is equal to the score, i.e. the derivative of the log-likelihood for
γ, in the logistic regression model logit
P (
R = 1|
Y, V) = –
h (
Y, V;
γ).
A second option is to assume that
E (
X|Y, V, R = 1) = Pr (
X = 1|
R = 1,
Y, V) follows a parametric model
where
m (
Y, V;
μ) is a known function, smooth in
μ, and
μ0 is an unknown
pm × 1 parameter vector. Assumption (
2.11) is a model for the disease probabilities among the verified subjects. The model defined by (
2.3) with specified
q (
Y, V) and (
2.11) is a semiparametric model for the law of the observed data which we will denote with

. The Semi-Parametric Maximum Likelihood (SPML) estimator of
θ(
c) under model

is obtained by replacing each
Xi in (
2.1) with

, where
with Pr (
X = 1|
R = 1,
Y, V;
μ) = exp {
m (
Y, V;
μ)} / [1 + exp {
m (
Y, V;
μ)}], and

solves the score equations
with

.
Neither option is entirely satisfactory because option 1 results in inconsistent estimation if model

is incorrect and option 2 results in inconsistent estimation if model

is misspecified. There is yet a third option which provides an estimator consistent for
θ(
c) under either model

or model

but not necessarily both. Specifically, define
where
Let

be the estimator of
θ obtained by replacing each
Xi in (
2.1) with

. That is,
The estimator

is said to be double-robust because, as stated in Theorem 1 below, it is consistent and asymptotically normal for
θ so long as model

, condition (
2.7) and one of the models (
2.11) or (
2.8) holds, but not necessarily both. The key to the consistency of

lies in the fact that if model

holds then, if model (
2.11) additionally holds,
or if model (
2.8) additionally holds,
In contrast to the first two estimators IPW and SPML, the double-robust estimator

gives the analyst two chances, instead of only one, to get nearly correct inference. Of course, there can be an efficiency price to using the DR estimator. If the disease regression model (
2.11) is correct both the DR estimator and the SPML estimator will be consistent but the DR estimator will generally have larger (asymptotic) variance. Clearly, the efficiency gains over

are obtained at the cost of potential severe bias.
An interesting remark is that, invoking the theoretical results of
Rotnitzky and Robins, 1997, it can be shown that if we use
uopt (
Y, V) to compute

then the

that uses this

is locally semiparametric efficient under model

at the local model

. This means that, if both models

and

are correct,

has asymptotic variance that attains the semiparametric variance bound for estimators bof
θ(
c) under model

. This

is unfeasible because to compute
uopt (
Y, V) we need to know
γ0 and, when
qB(
Y, V) ≠ 0, we also need the conditional probabilities
P (
X = 1|
R = 1,
Y, V). However, a feasible double-robust estimator that retains the local efficiency properties of the unfeasible

can be obtained with the following two-stage algorithm. At the first stage we compute

and a preliminary estimator

that solves (
2.9) using any
u (
Y, V). We then calculate the function

(
Y, V) defined like
uopt (
Y, V) in (
2.10) but with

replacing
γ0 and with the bexpectations in the numerator and denominator computed under
P (
X = 1|
R = 1,
Y, V;

). At the second stage, we compute

solving (
2.9) with
u replaced by

and finally compute

using this

. This remark raises the interesting point of how much efficiency is gained by the two-stage locally efficient

compared to the single stage

computed using

and the preliminary

. Our simulation study in Section 4 explores this issue using the natural, easy to compute, choice of
u (
Y, V) =
h (
Y, V;
γ) /
γ for the preliminary estimator

. For this choice, the theoretical results of
Rotnitzky and Robins, 1997, imply that the single stage and the two-stage estimators are equally asymptotically efficient if
q (
Y, V) = 0 but the two-stage estimator has strictly smaller asymptotic variance if
q (
Y, V) ≠ 0. For ease of reference, we call the single stage estimator

that uses

with
u (
Y, V) =
h (
Y, V;
γ) /
γ the DR1 estimator and we call the two-stage estimator

using this

as the preliminary estimator of
γ, the DR2 estimator.
Alonzo and Pepe (AP), 2005, derived a locally-efficient double-robust estimator that, except for a minor difference in the computation of

, is computed identically to our
DR2 estimator in the case
q = 0. Note that when
q = 0, the function
uopt (
Y, V) reduces to

. The distinction in the computation of

between our procedure and that of AP is that our

solves (
2.9) with
u (
Y, V) replaced by

while AP's

solves (
2.9) but with
u (
Y, V) replaced by the function
u (
Y, V; γ) =
π (
Y, V; γ) ×
h (
Y, V; γ) /
γ, depending on the unknown
γ. It is easy to show that both estimators of
γ are asymptotically equivalent and efficient. Consequently, when
q = 0, our DR2 estimator is asymptotically equivalent to AP's estimator of
θ.
The following Theorem whose proof is sketched in the
Appendix establishes the asymptotic properties of the estimator

computed using any fixed function
u (
Y, V) . In particular, it establishes the double-robustness property of

. It also provides a consistent variance estimator when model

, condition (
2.7) and either model (
2.11) or model (
2.8) holds. Further details can be found in
Fluss (2006). To state the theorem define
θ1 =
Se (
c) and
θ2 =
Sp (
c),
κ =
E (
X),
where
γ* and
μ* are the probability limits of

and

. Note that under standard regularity conditions for L-estimators (see for example,
van der Vaart (1998)),
γ* =
γ0 when model (
2.8) holds and
μ* =
μ0 when model (
2.11) holds.
Theorem 1 Suppose that model

and (
2.7) hold. Under standard regularity conditions for L-estimators, if model (
2.8) and (
2.7) hold or if model (
2.11) holds,

, where Ω = Cov(M). Ω can be consistently estimated with
where
2.1 Illustrative Example
To illustrate our double-robust estimator of the ROC curve we used simulated data following the model used by
Rotnitzky et al. (2006) in their simulation study. We first generated for each of n=200 subjects a binary disease indicator
X ~
Bernoulli(0.3), such that approximately 30% of subjects were diseased. We further simulated a continuous marker
Y from the model
Y|
X =
x ~
N (0.37
x, 0.5
2) , a binary covariate
V =
I (
X – 0.3 +
ε > 0) (and hence conditionally independent of
Y) where
ε ~
N (0, 0.3
2) , and a response indicator
R following the selection for verification model (
2.8) with
h (
Y, V) =
γ0 +
γ1Y +
γ2V and
q (
Y, V) =
β with (
γ0,
γ1,
γ2) = (3, −2, −3) and
β = −1. The values of (
γ0,
γ1,
γ2,
β) were chosen so that roughly 70% of the
X′
s would be missing. The covariate
V was purposely chosen to have a large correlation (roughly 0.75) with
X so that the bias of the estimator based on the verified patients only, be large. The empirical DR1 ROC curves were obtained by estimating
Se (
c) and
Sp (
c) as described above for every
c ![[sm epsilon]](/corehtml/pmc/pmcents/x220A.gif)
{
y1, ...,
yn} where the
yi's represent the 200 simulated values of
Y. illustrates several estimated ROC curves using one simulated sample along with the true and unknown ROC curve (smooth line). For comparison we added the biased estimator (NAIVE) using formula (
2.1) based on verified subjects only and the (unfeasible) complete data estimator (COMPLETE) using the entire sample. We also added the estimator assuming missing at random (MAR) using
q = 0. Notice the large bias of the NAIVE curve whereas the DR1 is closer to both the complete data estimator and true ROC. DR2 is very similar to DR1 and is not exhibited. The MAR curve is farther from the true ROC than DR1.