Home | About | Journals | Submit | Contact Us | Français |

**|**HHS Author Manuscripts**|**PMC3475507

Formats

Article sections

Authors

Related links

Comput Stat Data Anal. Author manuscript; available in PMC 2012 October 18.

Published in final edited form as:

PMCID: PMC3475507

NIHMSID: NIHMS82944

The publisher's final edited version of this article is available at Comput Stat Data Anal

See other articles in PMC that cite the published article.

We consider the estimation of the parameters indexing a parametric model for the conditional distribution of a diagnostic marker given covariates and disease status. Such models are useful for the evaluation of whether and to what extent a marker’s ability to accurately detect or discard disease depends on patient characteristics. A frequent problem that complicates the estimation of the model parameters is that estimation must be conducted from observational studies. Often, in such studies not all patients undergo the gold standard assessment of disease. Furthermore, the decision as to whether a patient undergoes verification is not controlled by study design. In such scenarios, maximum likelihood estimators based on subjects with observed disease status are generally biased. In this paper, we propose estimators for the model parameters that adjust for selection to verification that may depend on measured patient characteristics and additonally adjust for an assumed degree of residual association. Such estimators may be used as part of a sensitivity analysis for plausible degrees of residual association. We describe a doubly robust estimator that has the attractive feature of being consistent if either a model for the probability of selection to verification or a model for the probability of disease among the verified subjects (but not necessarily both) is correct.

In evaluating the usefulness of a diagnostic marker an important question is whether and to what extent a marker’s diagnostic ability depends on the patient’s characteristics. To that extent, models for the distribution of the marker conditional on covariates and disease status play an important role in that they help elucidate this dependence. For instance, these models can be used to compute covariate specific measures of diagnostic accuracy such as, specificity, sensitivity, the receiver operating characteristic curve and the area under the curve (AUC) or the partial area, the Youden index and the likelihood ratio.

Estimation of diagnostic marker accuracy often relies on observational databases of medical records. Because disease verification is often expensive and/or invasive, doctors send their patients for the definite assessment of disease only if, based on the marker results and patient’s risk factors, they judge it necessary to do so. Consequently, verified subjects often differ from non-verified subjects in prognostic factors for disease. As such, estimators of diagnostic accuracy measures based on verified subjects only are generally biased.

A number of articles have proposed methods to correct the estimation of diagnostic test accuracy measures under verification bias, most of which are reviewed in the book of Zhou, Obuchowski, and McClish, 2002 and in Zhou, 1998a. The articles of Rodenberg and Zhou, 2000, Gray, Begg and Greenes, 1984, Zhou, 1998b, and Zhou and Riggs, 2000, Toledano and Gatsonis 1996, 1999, Zhou, 1996 considered only discrete covariates and assumed ignorable verification, i.e. that the probability of verification is conditionally independent of true disease status given the test result and covariates. However, ignorable verification is often an unrealistic assumption because the doctor’s decision to send a patient to verification is typically based on his/her detailed information on the patient’s health status, which cannot be accurately summarized by measured covariates, much less by a discrete covariate. Acknowledging this, Baker, 1995, Kosinski and Barnhart, 2003, and Zhou, 1993, 1994, Zhou and Rodenberg, 1998 and Zhou and Castelluccio, 2003 considered estimation under non-ignorable verification models. An important limitation of these papers is that they assumed that both the covariates and marker are discrete variables. Recently, Rotnitzky, Faraggi, and Schisterman, 2005, described estimators of the AUC under ignorable or non-ignorable verification bias, which contemplate the possibility that the marker is continuous and/or that the covariates are continuous and high dimensional.

In this paper we derive estimators of the parameters indexing a parametric model for the conditional distribution of a continuous or ordinal marker given disease status and, a possibly high dimensional vector of, covariates when the disease status is observed only on a subset of the sample. Our estimators adjust for selection to verification that may depend on measured patient characteristics and marker values, and additionally adjust for an assumed degree of residual association between selection to verification and true disease status due to unmeasured patient variables.

Our research was motivated by the investigation of the diagnostic properties of a cardiology device called electron beam computed tomography (EBCT). EBCT is a device routinely used to measure the degree of calcification in the arteries. It is unknown, however, if EBCT can also help screen for myocardial ischemia (reduced blood flow due to obstruction in the vessels) and if so, to what extent its screening ability depends on the patient’s risk factors. These are important questions because the other available noninvasive instrument for screening myocardial ischemia is the Dual Isotope Myocardial Perfusion Single Photon Emission Computed Tomography (SPECT) which is substantially more expensive than EBCT and contrary to EBCT, requires the ingestion of radioactive tracing material. To investigate this question we had available data collected at the Radiology Clinic of the Nuclear Imaging Group at Cedars Sinai Medical Center. Asymptomatic patients with high risk factors for coronary artery disease were referred by their doctors to have EBCT performed (Braun et al, 1996) at the clinic. However, of the 5130 males age 36–88 on whom EBCT was performed, only 379 patients were sent by the clinic doctors for assessment of myocardial ischemia. Verification status was significantly associated with daily aspirin use (p value < 0.01). Indeed, we can reasonably suspect that a doctor’s decision to send a patient to verification was based on the patient clinical records and not just on the data recorded at the radiology center, i.e. not on the data available to us for analysis. Thus, non-ignorable verification is most likely the case in this problem.

This paper is organized as follows. Section 2 formulates the problem, section 3 discusses identification issues and motivates the assumptions made in this paper to identify the model parameters. Section 4 describes the estimators and Section 5 describes an example.

Suppose that on each subject *i* independently sampled from a population of interest, *i* = 1…*n*, we measure a continuous or ordinal diagnostic test result (also referred throughout as the “marker”) *Y _{i}*, and a vector of additional health and demographic variables

Under our setting, the observed data are comprised of n iid copies *O _{i}* = (

$$f(Y\mid V,X)=f(Y\mid V,X;{\beta}_{0})$$

(1)

where for each *β*, *f* (*Y* |*V*, *X; β*) is a known conditional density and *β*_{0} is a unknown *p* ×1 vector.

To help motivate our estimation method and the mathematical results supporting the sensitivity analysis that we advocate in this paper, in this section we will consider nonparametric estimation of *f* (*Y* |*V*, *X*) under a non-parametric full-data model, that is, under a model that places no restrictions on the joint distribution of (*Y*, *X*, *V*).

The distribution *f* (*Y* |*V*, *X*) is not identified by the observed data. This can be easily seen by writing

$$f(Y\mid V,X=x)=\frac{\left[{\sum}_{r=0}^{1}Pr(X=x\mid Y,V,R=r)Pr(R=r\mid Y,V)\right]f(Y\mid V)}{\int {\sum}_{r=0}^{1}\{Pr(X=x\mid y,V,R=r)Pr(R=r\mid y,V)\}f(y\mid V)dy}$$

The expression Pr (*X* = *x*|*Y*, *V*, *R* = 0) in the right hand side is not identified by the observed data because it is the probability of disease among non-verified subjects with a given value of the marker and the covariates. It therefore follows that *f* (*Y* |*V*, *X* = *x*) is also not identified.

To identify *f* (*Y* |*V*, *X*) we will assume a pattern-mixture model (Little and Rubin, 2002, Little, 1994, Robins, Rotnitzky, and Scharfstein, 2000) that specifies the connection between distributions Pr (*X* = 1|*Y*, *V*, *R* = 1) and Pr (*X* = 1|*Y*, *V*, *R* = 0). Specifically, we will assume that

$$Pr(X=1\mid Y,V,R=0)=\frac{Pr(X=1\mid Y,V,R=1){e}^{q(Y,V)}}{c(Y,V)}$$

(2)

where *q*(*y*, *v*) is an arbitrary, specified, function and

$$c(Y,V)\equiv E\left[{e}^{q(Y,V)X}\mid Y,V,R=1\right]=1-Pr(X=1\mid Y,V,R=1)\left[1-{e}^{q(Y,V)}\right]$$

is a normalizing factor.

The density *f* (*Y* |*V*, *X*) is identified under model 2. To see this note that since Pr (*X* = 1|*Y*, *V*, *R* = 1) is a functional of the observed data distribution and the right hand side of (2) depends only on Pr (*X* = 1|*Y*, *V*, *R* = 1), then Pr (*X* = 1|*Y*, *V*, *R* = 0) is identified under (2).

It is clear that since any function *q* (*y*, *v*) can be chosen, then *q* (*y*, *v*) is not identified by the observed data. Since the data provide no information to guide the choice of the function *q*(*y*, *v*) then one reasonable analytical strategy, the one advocated in this paper, is to specify a range of plausible functions and repeat the estimation of *β*_{0}, each time regarding a different function *q* (*y*, *v*) fixed and known as a form of sensitivity analysis.

Models that make explicit assumptions about the verification process are referred to as selection models (Little and Rubin, 2002, Little, 1994, Robins, Rotnitzky, and Scharfstein, 2000). The pattern mixture model (2) is indeed also a selection model. Specifically, by Bayes’ rule, assumption (2) is equivalent to

$$log\left\{\frac{Pr(R=0\mid Y,X,V)}{Pr(R=1\mid Y,X,V)}\right\}=h(Y,V)+q(Y,V)X$$

(3)

where

$${e}^{h(Y,V)}=\frac{Pr(R=0\mid Y,V)}{Pr(R=1\mid Y,V)}\times \frac{1}{c(Y,V)}$$

(4)

This alternative expression provides yet another interpretation for *q*(*Y*, *V*). Specifically, *q*(*Y*, *V*) quantifies the dependence on the unobserved disease status of the logarithm of the odds of non-verification after adjustment for the marker and the observed covariates. Thus *q*(*Y*, *V*) is a measure of the degree of the residual association between *X* and the probability of verification. It reflects the degree of belief by the investigator about the extent of verification bias after accounting for (*Y*, *V*). The choice *q*(*Y*, *V*) = 0 for all *V* and *Y* corresponds to the assumption that the missing disease status is missing at random, that is

$$Pr(R=1\mid Y,X,V)=Pr(R=1\mid Y,V)$$

(5)

The identifiability considerations discussed so far assumed that the model for the full data did not place restrictions on the distribution of (*Y*, *V*, *X*). However, in section 2 we indicated that our goal was to estimate *f* (*Y* |*V*, *X*) under a parametric model for it. Because of this parametric restriction, *q* (*Y*, *V*) may be identified. One could therefore consider estimating *q* (*Y*, *V*), rather than regarding it as fixed and known. The methods described in this paper do not require that *q* (*Y*, *V*) be known. However, we prefer to continue to regard *q*(*Y*, *V*) as fixed and known and vary it in a sensitivity analysis. Our reason for this is the following. We consider estimation *f* (*Y* |*V*, *X*) under a parametric model only because with the sample sizes found in practice, estimation of *f* (*Y* |*V*, *X*) is unfeasible if *V* is high dimensional due to the curse of dimensionality. However, we would like these dimension reducing considerations to have as little impact as possible on the validity of inferences about *f* (*Y* |*V*, *X*). Estimation of *q* (*Y*, *V*) under a parametric model for *f* (*Y* |*V*, *X*) has to necessarily be highly sensitive to misspecification of this parametric model because the identifiability of *q* (*Y*, *V*) depends solely on the distributional shape assumptions about *f* (*Y* |*V*, *X*). Thus, in order to reduce the impact of model misspecification, we prefer to take the point of view that *q* (*Y*, *V*) should not be estimated, but instead varied in a sensitivity analysis. In any case, for completeness, in section 4.2.1 we explain how our proposed estimators can be extended to accommodate models in which *q* (*Y*, *V*) is assumed unknown and parametrically modelled. We shall denote with
(*q*) the model defined by the restrictions (1) and (2) with *q* fixed and known.

Note that assumption (3) implicitly assumes that all subjects have a positive probability of being selected for verification, regardless of the marker values and covariates. That is, the assumption implies that

$$Pr(R=1\mid Y,X,V)>0\phantom{\rule{0.38889em}{0ex}}\text{with}\phantom{\rule{0.16667em}{0ex}}\text{probability}\phantom{\rule{0.16667em}{0ex}}1.$$

(6)

This assumption may not always be realistic, especially when extreme values of the marker or covariates provide convincing evidence in favor or against the presence of disease so that clinicians may never send such patients for verification.

Though *β*_{0} is identified under model
(*q*), unfortunately its estimation is unfeasible due to the curse of dimensionality. That this is the case follows from the results stated in Theorem 1 below. Specifically, this Theorem indicates that if a regular asymptotically linear estimator of *β*_{0} existed, its influence function would have to be, up to a multiplicative constant matrix, of the form

$$U({\beta}_{0},h,\varphi ,\phi )={\mathrm{\Delta}}_{\varphi}(Y,V,X)+\left[\frac{R}{\pi (h)}-1\right]\phantom{\rule{0.16667em}{0ex}}[{\mathrm{\Delta}}_{\varphi}(Y,V,X)-\phi (Y,V)]$$

(7)

for some *p* × 1 vector functions *ϕ* (*Y*, *V*, *X*) and (*Y*, *V*) where

$${\mathrm{\Delta}}_{\varphi}(Y,V,X)\equiv \varphi (Y,V,X)-{E}_{{\beta}_{\mathbf{0}}}[\varphi (Y,V,X)\mid V,X].$$

(8)

and *π* (*h*) {1 + exp [*h* (*Y*, *V*) + *q* (*Y*, *V*) *X*]}^{−1}. Since by construction Δ* _{ϕ}* (

$$E\left\{\left[\frac{R}{\pi (h)}-1\right][{\mathrm{\Delta}}_{\varphi}(Y,V,X)-\phi (Y,V)]\right\}=0$$

However, in section 4.3 we show that if *h*^{†} ≠ *h* then *U* (*β*_{0}, *h*^{†}, *ϕ*, ) does not have mean zero under *f* (*Y* |*X*, *V ; β*_{0}) unless (*Y*, *V*) = * _{DR}* (

$${\phi}_{DR}(Y,V)\equiv \frac{{\sum}_{x=0}^{1}{\mathrm{\Delta}}_{\varphi}(Y,V,x)exp\{xq(Y,V)\}Pr(X=x\mid R=1,Y,V)}{{\sum}_{x=0}^{1}exp\{xq(Y,V)\}Pr(X=x\mid R=1,Y,V)}$$

(9)

This implies that estimation of *β*_{0} under model
(*q*) requires preliminary consistent estimation of either * _{DR}* (

In this subsection we will consider estimation of *β*_{0} under a model
(*q*) defined like
(*q*) but with the additional restriction

$$h(y,v)=h(y,v;{\gamma}_{0})$$

(10)

where for each (*y*, *v*), *h*(*y*, *v; γ*) is a known smooth function of *γ*, and *γ*_{0} is a *r* × 1 vector of unknown parameters.

Model
(*q*) assumes that Pr(*R* = 1|*Y*, *X*, *V*) follows the (parametric) logistic regression model

$$log\left\{\frac{Pr(R=0\mid Y,X,V)}{Pr(R=1\mid Y,X,V)}\right\}=h(Y,V;{\gamma}_{0})+q(Y,V)X$$

(11)

It is important to note that model
(*q*) is not a parametric model for the joint distribution of (*Y*, *X*, *V*) since it leaves the distributions *f*(*X*|*V*) and *f* (*V*) unrestricted. Inference under model
(*q*), as opposed to likelihood based inference under a model that additionally imposes parametric restrictions on *f*(*X*|*V*) and *f* (*V*), has the advantage of being valid regardless of what the true distributions *f*(*X*|*V*) and *f* (*V*) are. In contrast, likelihood based inference under a fully parametric model is generally sensitive to misspecification of the model for *f*(*X*|*V*), thus violating the essential property of regression estimators with full data, of being consistent regardless of what the distribution of the covariates is. Specifically, suppose that models *f*(*V; η*_{1}) and *f*(*X*|*V; η*_{2}) for *f* (*V*) and *f* (*X*|*V*) indexed by finite dimensional parameters *η*_{1} and *η*_{2} were assumed. In such case, the likelihood contribution of each patient would be

$$L(\beta ,\gamma ,\eta )=f(V;{\eta}_{1}){\left\{\sum _{x=0}^{1}[1-Pr(R=1\mid Y,X=x,V;\gamma )]f(Y\mid X=x,V;\beta )Pr(X=x\mid V;{\eta}_{2})\right\}}^{1-R}\times {\{Pr(R=1\mid Y,X,V;\gamma )f(Y\mid X,V;\beta )Pr(X\mid V;{\eta}_{2})\}}^{R}$$

(12)

The term *f*(*V*; η_{1}) factors out of the likelihood, but Pr(*X*|*V*; η_{2}) doesn’t. Thus, consistent maximum likelihood estimation of *β*_{0} would require correctly specified models for *f*(*X*|*V; η*_{2}) in addition to correct specification of the distribution Pr(*R* = 1|*Y*, *X*, *V; γ*). Maximum likelihood about *β* is therefore sensitive to misspecification of both models.

Semiparametric model
(*q*) is defined by a semiparametric model for the law of the full data (*Y*, *X*, *V*) and a parametric non-ignorable model for the missingness probabilities. A general theory for inference in these models has been derived by Rotnitzky and Robins, 1997. In particular, these authors derived a general representation for the set of all influence functions for regular asymptotically linear estimators of finite dimensional parameters (*β*, *γ*) where, as in our problem, *β* is a parameter that depends on the full-data distribution and *γ* indexes a parametric model for the missingness probabilities. The following Theorem, specializes the results in Rotnitzky and Robins to obtain the representation of the set orthogonal to the nuisance tangent space for (*β*, *γ*) in model
(*q*). We subsequently use this representation to derive a class of estimating equations whose solutions are, up to asymptotic equivalence, all regular asymptotically linear estimators of (*β*_{0}, *γ*_{0}).

in model (q) defined like model (q) but with the additional restriction

$$Pr(R=1\mid Y,X,V)>\sigma >0$$

(13)

*for some σ* > 0, *the orthogonal complement to the nuisance tangent space for* (*β*, *γ*) *is equal to*

$${\mathrm{\Lambda}}_{\mathit{nuis}}^{\perp}=\left\{U({\beta}_{0},{\gamma}_{0};\varphi ,\phi )\equiv \frac{R}{\pi ({\gamma}_{0})}{\mathrm{\Delta}}_{\varphi}(Y,V,X)-A({\gamma}_{0};\phi ):E\left[\varphi {(Y,V,X)}^{2}\right]<\infty ,E\left[\phi {(Y,V)}^{2}\right]<\infty \right\}$$

*where for any γ*

$$\begin{array}{l}\pi (\gamma )\equiv Pr(R=1\mid Y,X,V;\gamma )\equiv {[1+exp\{h(Y,V;\gamma )+q(Y,V)X\}]}^{-1}\\ A(\gamma ;\phi )\equiv \left[\frac{R}{\pi (\gamma )}-1\right]\phi (Y,V)\end{array}$$

(14)

*and* Δ* _{ϕ}* (

It is a well known result from semiparametric theory that the influence functions of RAL estimators of (*β*_{0}, *γ*_{0}) are elements of the orthogonal complement to the nuisance tangent space. It follows from the previous theorem that if a RAL estimator of (*β*_{0}, *γ*_{0}) exists, then it must be asymptotically equivalent to a solution ((*ϕ*, ), (*ϕ*, )) to an estimating equation of the form

$${E}_{n}[U(\beta ,\gamma ;\varphi ,\phi )]=0$$

(15)

where *ϕ* (*Y*, *V*, *X*) and (*Y*, *V*) are now random vectors with dimension (*p* + *r*) × 1 and throughout we use the notational convention that for arbitrary random vectors
${B}_{1},\dots ,{B}_{n},{E}_{n}(B)\equiv {n}^{-1}{\sum}_{i=1}^{n}{B}_{i}$. This is so because, under regularity conditions, using standard Taylor expansion arguments it can be shown that

$$\sqrt{n}\left\{{\left(\widehat{\beta}(\varphi ,\phi ),\widehat{\gamma}(\varphi ,\phi )\right)}^{\prime}-{({\beta}_{0},{\gamma}_{0})}^{\prime}\right\}=\frac{1}{\sqrt{n}}\mathrm{\Gamma}{({\beta}_{0},{\gamma}_{0};\varphi ,\phi )}^{-1}\sum _{i=1}^{n}{U}_{i}({\beta}_{0},{\gamma}_{0};\varphi ,\phi )+{o}_{p}(1)$$

where

$${\mathrm{\Gamma}({\beta}_{0},{\gamma}_{0};\varphi ,\phi )=\frac{\partial}{\partial {(\beta ,\gamma )}^{\prime}}E[U(\beta ,\gamma ;\varphi ,\phi )]|}_{(\beta ,\gamma )=({\beta}_{0},{\gamma}_{0})}$$

A heuristic motivation for the class of estimating equations (15) is as follows. Consider first estimation of *β*_{0} when *X* is always observed. Any RAL estimator of *β*_{0} has to be asymptotically equivalent to an estimating equation of the form

$${E}_{n}\{{\mathrm{\Delta}}_{\varphi}(Y,V,X;\beta )\}=0$$

(16)

where

$${\mathrm{\Delta}}_{\varphi}(Y,V,X;\beta )\equiv \varphi (Y,X,V)-\int \varphi (y,X,V)f(y\mid X,V;\beta )dy$$

That this must be the case follows from the fact, that any mean zero function of (*Y*, *X*, *V*) that is orthogonal to nuisance scores in the full data model must be of the form Δ* _{ϕ}* (

When, as in our problem, *X* is not observed for some subjects and the response probabilities depend on *Y* we can no longer use the estimating equation (16) since subjects with observed *X* are a biased sample, and for them Δ* _{ϕ}* (

If the response probabilities *π*(*γ*_{0}) were known, we could estimate *β* solving the inverse probability weighted estimating equation

$${E}_{n}\left\{\frac{R}{\pi ({\gamma}_{0})}{\mathrm{\Delta}}_{\varphi}(Y,V,X;\beta )\right\}=0$$

This equation is unbiased because

$$E\left\{\frac{R}{\pi ({\gamma}_{0})}{\mathrm{\Delta}}_{\varphi}(Y,V,X;{\beta}_{0})\right\}=E\left\{\frac{E(R\mid Y,V,X)}{\pi ({\gamma}_{0})}{\mathrm{\Delta}}_{\varphi}(Y,V,X;{\beta}_{0})\right\}=0$$

Since *γ*_{0} is unknown we must replace it by a consistent estimator of it. However, we can’t estimate *γ*_{0} fitting a logistic regression model of *R* on *X*, *V* and *Y* with offset *q* (*Y*, *V*) because *X* is missing when *R* = 0. Instead, we can solve the estimating equation

$${E}_{n}[A(\gamma ,{\phi}_{\gamma})]=0$$

(17)

where * _{γ}* is an arbitrary, user specified,

$${E}_{n}\left\{\frac{R}{\pi (\widehat{\gamma})}{\mathrm{\Delta}}_{\varphi}(Y,V,X;\beta )\right\}=0$$

(18)

will have a solution * _{w}* that will be consistent and asymptotically normal for estimating

The solution * _{w}* is not entirely satisfactory because it may be very inefficient. Specifically, data from subjects with

Suppose that instead of regarding *q* (*Y*, *V*) as known, we assume a parametric model

$$q(Y,V)=q(Y,V;{\xi}_{0})$$

where for each (*y*, *v*), *q* (*y*, *v; ξ*) is a smooth function of *ξ* and *ξ*_{0} is an unknown *t* × 1 parameter vector. Theorem 1 remains true with the following modifications:

$${\mathrm{\Lambda}}_{\mathit{nuis}}^{\perp}=\left\{U({\beta}_{0},{\gamma}_{0},{\xi}_{0};\varphi ,\phi )\equiv \frac{R{\mathrm{\Delta}}_{\varphi}(Y,V,X)}{\pi ({\gamma}_{0},{\xi}_{0})}-A({\gamma}_{0},{\xi}_{0};\phi ):E\left[\varphi {(Y,V,X)}^{2}\right]<\infty ,E\left[\phi {(Y,V)}^{2}\right]<\infty \right\}$$

where for any *γ*, *ξ*,

$$\begin{array}{l}\pi (\gamma ,\xi )\equiv Pr(R=1\mid Y,X,V;\gamma ,\xi )\equiv {[1+exp\{h(Y,V;\gamma )+q(Y,V;\xi )X\}]}^{-1}\\ A(\gamma ,\xi ;\phi )\equiv \left[\frac{R}{\pi (\gamma ,\xi )}-1\right]\phi (Y,V)\end{array}$$

Consequently, if a RAL estimator of (*β*, *γ*, *ξ*) exists, it will be asymptotically equivalent to a solution ((*ϕ*, ), (*ϕ*, ), (*ϕ*, ) to an estimating equation of the form

$${E}_{n}[U(\beta ,\gamma ,\xi ;\varphi ,\phi )]=0$$

where *ϕ*(*Y*, *V*, *X*) and (*Y*, *V*) are now random vectors with dimension (*p* + *r* + *t*) × 1.

The estimators (*ϕ*, ) of the previous subsection can be severely biased if the model (10) is misspecified because some verified subjects may receive inappropriate weight. Interestingly, there is a specific choice of (*Y*, *V*) that gives some protection to misspecification of (10).

Specifically, consider the function * _{DR}* (

$${\phi}_{DR}(Y,V)=\frac{E[{e}^{q(Y,V)X}{\mathrm{\Delta}}_{\varphi}(Y,X,V)\mid R=1,Y,V]}{E[{e}^{q(Y,V)X}\mid R=1,Y,V]}$$

(19)

$$=E[{\mathrm{\Delta}}_{\varphi}(Y,X,V)\mid R=0,Y,V]$$

(20)

where the last equality is true under (2). Remarkably, it can be shown that *U*(*β*_{0}, *γ**; *ϕ*, * _{DR}*) has mean zero even if

$$E[U({\beta}_{0},{\gamma}^{\ast};\varphi ,{\phi}_{DR})]=0$$

(21)

Equation (21) implies that if * _{DR}*(

$${E}_{n}\left\{\frac{R}{\pi (\widehat{\gamma})}{\mathrm{\Delta}}_{\varphi}(Y,V,X)-\left[\frac{R}{\pi (\widehat{\gamma})}-1\right]{\widehat{\phi}}_{DR}(Y,V))\right\}=0$$

where *ϕ* (*Y*, *V*, *X*) is an arbitrary *p* × 1 function of (*Y*, *V*, *X*) and solves

$${E}_{n}\{A(\gamma ,{\phi}_{\gamma})\}=0$$

for an arbitrary *r* × 1 function * _{γ}* (

$$Pr(X=1\mid R=1,Y,V)=m(Y,V;{\mu}_{0})$$

(22)

where for each *y*, *v*, *m* (*y*, *v; μ*) is a smooth function of *μ* and *μ*_{0} is an unknown *l* × 1 parameter vector.

A consistent estimator of *μ*_{0} under model (22) can be obtained by solving unbiased estimating equations for binary regression (based on the subjects with observed *X’s*) of the form

$${E}_{n}\{H(\mu ;d)\}=0$$

(23)

where *H* (*μ; d*) *Rd* (*Y*, *V*) {*X* – *m* (*Y*, *V ; μ*)} and *d* (*Y*, *V*) is a user specified *l* ×1 function of *Y* and *V*.

Define * _{DR}* (

$$M(\beta ,\gamma ,\mu ,;\varphi ,{\phi}_{\gamma},d)\equiv \left[\begin{array}{c}U(\beta ,\gamma ,\mu ,\varphi )\\ A(\gamma ;{\phi}_{\gamma})\\ H(\mu ;d)\end{array}\right]$$

where *ϕ*(*Y*, *V*, *X*) and * _{γ}*(

The following Theorem establishes the asymptotic distribution of an estimator * _{DR}* (

$${E}_{n}\{M(\beta ,\gamma ,\mu ,;\varphi ,{\phi}_{\gamma},d)\}=0$$

(24)

The Theorem establishes that the estimator * _{DR}* (

$$\begin{array}{l}\mathrm{\Upsilon}\equiv E\left[{\frac{\partial U({\beta}_{0},\gamma ,{\mu}^{\ast};\varphi )}{\partial \gamma}|}_{\gamma ={\gamma}^{\ast}}\right]{\left\{E\left[{\frac{\partial A(\gamma ;{\varphi}_{\gamma})}{\partial {\gamma}^{\prime}}|}_{\gamma ={\gamma}^{\ast}}\right]\right\}}^{-1}\times A({\gamma}^{\ast};{\phi}_{\gamma})\\ \mathrm{\Psi}\equiv E\left[{\frac{\partial U({\beta}_{0},{\gamma}^{\ast},\mu ;\varphi )}{\partial \mu}|}_{\mu ={\mu}^{\ast}}\right]{\left\{E\left[{\frac{\partial H(\mu ;d)}{\partial {\mu}^{\prime}}|}_{\mu ={\mu}^{\ast}}\right]\right\}}^{-1}\times H({\mu}^{\ast};d)\end{array}$$

and

$$\mathrm{\sum}=E\left[{\frac{\partial U(\beta ,{\gamma}^{\ast},{\mu}^{\ast};\varphi )}{\partial {\beta}^{\prime}}|}_{\beta ={\beta}_{0}}\right]$$

*Suppose model*
(*q*) *holds. Then under regularity conditions, if either* (10) *and* (13) *hold or* (22) *holds*,

*the estimating equation*(24)*has a solution*((_{DR}*ϕ*,,_{γ}*d*), , ) which satisfies$$\sqrt{n}\left\{{\widehat{\beta}}_{DR}(\varphi ,{\phi}_{\gamma},d)-{\beta}_{0}\right\}={\mathrm{\sum}}^{-1}\sqrt{n}{E}_{n}\{U-\mathrm{\Upsilon}-\mathrm{\Psi}\}+{o}_{p}(1)$$*Thus*, $\sqrt{n}\left\{{\widehat{\beta}}_{DR}(\varphi ,{\phi}_{\gamma},d)-{\beta}_{0}\right\}\to N\left(0,{\mathrm{\sum}}^{-1}\mathrm{\Gamma}{\mathrm{\sum}}^{-1\prime}\right)$*where*Γ =*var*(*U*– – Ψ).*A consistent estimator of*Σ^{−1}ΓΣ^{−1′}*is given by*^{−1}^{−1′}*where*$\widehat{\mathrm{\sum}}={E}_{n}\left[{\frac{\partial U(\beta ,\widehat{\gamma},\widehat{\mu};\varphi )}{\partial {\beta}^{\prime}}|}_{\beta ={\widehat{\beta}}_{DR}}\right],\phantom{\rule{0.16667em}{0ex}}\widehat{\mathrm{\Gamma}}={E}_{n}\left(\left(\widehat{U}-\widehat{\mathrm{\Upsilon}}-\widehat{\mathrm{\Psi}}\right)\phantom{\rule{0.16667em}{0ex}}{\left(\widehat{U}-\widehat{\mathrm{\Upsilon}}-\widehat{\mathrm{\Psi}}\right)}^{\prime}\right)$*with*(_{DR}(_{DR}*ϕ*,,_{γ}*d*),*Û**U*(, , ,_{DR}*ϕ*),$$\widehat{\mathrm{\Upsilon}}\equiv {E}_{n}\left[{\frac{\partial U({\widehat{\beta}}_{DR},\gamma ,\widehat{\mu};\varphi )}{\partial \gamma}|}_{\gamma =\widehat{\gamma}}\right]{\left\{{E}_{n}\left[{\frac{\partial A(\gamma ;{\phi}_{\gamma})}{\partial {\gamma}^{\prime}}|}_{\gamma =\widehat{\gamma}}\right]\right\}}^{-1}\times A(\widehat{\gamma};{\phi}_{\gamma})$$*and*$$\widehat{\mathrm{\Psi}}\equiv {E}_{n}\left[{\frac{\partial U({\widehat{\beta}}_{DR},\widehat{\gamma},\mu ;\varphi )}{\partial \mu}|}_{\mu =\widehat{\mu}}\right]{\left\{{E}_{n}\left[{\frac{\partial H(\mu ;d)}{\partial {\mu}^{\prime}}|}_{\mu =\widehat{\mu}}\right]\right\}}^{-1}\times H(\widehat{\mu};d)$$

The asymptotic variance of * _{DR}* (

Our doubly-robust estimator of *β*_{0} requires that we specify parametric models (22) and (11) for Pr(*X* = 1|*R* = 1, *Y*, *V*) and Pr(*R* = 1|*Y*, *X*, *V*) respectively. However, these models may be incompatible in the sense that the restrictions they imply on the distribution *f*(*Y* |*X*, *V*) may never hold simultaneously. This apparent contradiction does not pose any practical difficulty. By posing two models, (22) and (11), we certainly obtain better protection against model misspecification than posing just model (11). The fact that (22) and (11) cannot hold simultaneously is inconsequential since our doubly-robust estimators are consistent and asymptotically normal so long as one, but not necessarily both models is correct.

On occasion, one might have available a vector containing a large number of variables *V* thought to be simultaneous predictors of verification and disease status. However, one might want to estimate a model for the distribution of the marker conditional only on a function *Z* = *b* (*V*) of *V*. As an example, in the application illustrated in section 5, we had available records on age and whether or not the patient was using aspirin daily. However, we were not interested in a refined assessment of how the predictive ability of the marker changed with age, primarily because we had insufficient a-priori information as to how to model this association adequately. Instead, we wanted to evaluate if, quite generally, the marker’s predictive accuracy was better in the elderly. As such, we considered *V* = (*V*_{1}, *V*_{2}), where *V*_{1} = age and *V*_{2} = indicator of aspirin use, and *Z* = (*I* (*V*_{1} ≤ 55), *V*_{2}). The results in this paper, including the double-robustness of * _{DR}* (

We applied the above methodology to the dataset described in the introduction. Electron-beam computarized tomography (EBCT) of the coronary arteries is a method for measuring coronary artery calcium, and is an established marker of atherosclerotic plaque. It is unknown however, whether EBCT is an effective screening tool for myocardial ischemia. This is an important question because the other available noninvasive instrument for screening myocardial ischemia is the Dual Isotope Myocardial Perfusion Single Photon Emission Computed Tomography (SPECT) which is substantially more expensive than EBCT and contrary to EBCT, requires the ingestion of radioactive tracing material.

Of the 5130 males age 36–88 on whom EBCT was performed at the radiology clinic, 379 patients were sent by the clinic doctors for assessment of myocardial ischemia with SPECT, and of them 28 were found positive. EBCT histograms stratified by verification and disease status (figures not shown), suggest that compared to the EBCT distribution of the non-verified subjects, the distributions of verified subjects are markedly skewed to the right indicating that subjects with lower marker values are less likely than others to be verified.

Here we consider estimation of the distribution of *Y* = log (*EBCT* + 1) conditional on the prognostic factors age and use of daily aspirin. We had available records on the variable *V* = (*V*_{1}, *V*_{2}) with *V*_{1} age and *V*_{2} the binary indicator of daily use of aspirin, but for reasons indicated in section 4.3.3 we were interested in fitting a model for *f* (*Y* |*Z*, *X*) where *Z* = (*Z*_{1}, *Z*_{2}), *Z*_{1} = *I* (*V*_{1} ≤ 55), *Z*_{2} = *V*_{2} and *X* is the indicator of coronary disease as determined by SPECT. Our model (1) for the conditional distribution of *Y* given covariates *Z* and disease status was

$$f(Y\mid X,Z;\beta )=p{(X,Z;{\beta}_{P})}^{1(Y=0)}{\left[\frac{1-p(X,Z;{\beta}_{P})}{\sqrt{2\pi {\sigma}^{2}(X,Z;{\beta}_{v})}}{e}^{\frac{{(Y-\omega (X,Z;{\beta}_{m}))}^{2}}{2{\sigma}^{2}(X,Z;{\beta}_{v})}}\right]}^{1(Y\ne 0)}$$

where *p* (*X*, *Z; β _{P}*) = (1 −

Our estimating equations used

$$\varphi (Y,Z,X;\beta )=\left[\begin{array}{c}W(Y-{W}^{\prime}{\beta}_{m}){e}^{-{B}^{\prime}{\beta}_{v}}I(Y>0)\\ B\left\{{e}^{-{B}^{\prime}{\beta}_{v}}{(Y-{W}^{\prime}{\beta}_{m})}^{2}-1\right\}I(Y>0)\\ DI(Y=0)I(X=0))\end{array}\right]$$

In our analysis we estimated *β* = (*β _{m}*,

Figures 1–4 display the estimated log-likelihood ratio function log {*f* (*y*|*Z* = *z*, *X* = 1; ) / *f*(*y*|*Z* = *z*, *X* = 0; )}for values of *z* representing the four sub-groups defined by age and aspirin use status and for values of *y* in the range of observed values of *Y*. The figure displays the estimated log-likelihood ratio using the doubly-robust estimates of *β* for the choices *q* = −3, *q* = 0 and *q* = 1. Overall, lower values of the marker appear effective at ruling out myocardial ischemia, however higher values of the marker do not appear particularly useful at ruling in disease. The marker has noticeably better ability for ruling out myocardial ischemia in the elderly, but it appears otherwise similar in the two aspirin groups. The markers ability for ruling in ischemia at higher values of the marker is similarly poor in the four groups.

For the most part the log-likelihood ratio estimated using only verified subjects is lower across all values of *y* than those based on adjustment for selection bias regardless of the value of *q*, indicating that an analysis based on verified only subjects will overestimate specificity but underestimate sensitivity. Interestingly, among those who were users of daily aspirin aged greater than 55 (the group most likely to have had their disease status verified) all the estimators resulted in estimates for the log-likelihood ratio that were not very different from each other. The results suggest that in settings with very severe residual selection bias in favour of diseased subjects (*q* = −3), the estimators assuming no residual selection bias (i.e. *q* = 0) would in general overestimate the likelihood ratio at lower values of the marker.

John Page was supported by grant # 0475016N of the American Heart Association and Andrea Rotnitzky was supported by NIH grants R29 GM48704 and R01 AI32475.

**Publisher's Disclaimer: **This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

1. Baker SG. Evaluating multiple diagnostic tests with partial verification. Biometrics. 1995;51:330–337. [PubMed]

2. Braun J, Oldendorf M, et al. Electron beam computed tomography in the evaluation of cardiac calcifications in chronic dialysis patients. Am J Kidney Dis. 1996;27:394–401. [PubMed]

3. Gray R„, Begg CB, Greenes RA. Construction of the receiver operating characteristic curves when disease verification is subject to selection bias. Medical Decision Making. 1984;4:151–164. [PubMed]

4. Hansen LP. Large sample properties of generalized methods of moments estimators. Econometrica. 1982;50:1029–1054.

5. Kosinski AS, Barnhart HX. Accounting for Non-ignorable Verification Bias in Assessment of Diagnostic Test. Biometrics. 2003;59(2) [PubMed]

6. Little R. A class of pattern-mixture models for normal missing data. Biometrika. 1994;81:471–483.

7. Little R, Rubin D. Statistical Analysis with Missing Data. J. Wiley; N. York: 2002.

8. Page J, Rotnitzky A. Technical Report. Harvard School of Public Health, Dep. of Biostatistics; 2008. Proofs of the assertions in the paper Estimation of the disease-specific diagnostic marker distribution under verification bias, by Page and Rotnitzky.

9. Qu A, Lindsay B, Li B. Improving generalized estimating equations using quadratic inference functions. Biometrika. 2000;87:823–836.

10. Robins JM, Rotnitzky A, Scharfstein DO. Sensitivity analysis for selection bias and unmeasured confounding in missing data and causal inference models. In: Halloran E, Berry D, editors. Statistical Models for Epidemiology, the Environment, and Clinical Trials. 2000. pp. 1–95.

11. Rodenberg C, Zhou XH. ROC curve estimation when covariates affect the verification process. Biometrics. 2000;56:131–136. [PubMed]

12. Rotnitzky A, Robins J. Analysis of semi-parametric regression models with non-ignorable non-response. Statistics in Medicine. 1997;16:81–102. [PubMed]

13. Rotnitzky A, Faraggi D, Schisterman E. Doubly robust estimation of the area under the receiver-operating characteristic curve in the presence of verification bias. To appear. Journal of the American Statistical Association Theory and Methods 2005

14. Rubin D. Inference and missing data. Biometrika. 1976;72:581–92.

15. Toledano A, Gatsonis C. Ordinal regression methodology for ROC curves derived from correlated data. Statistics in Medicine. 1996;15:1807–26. [PubMed]

16. Toledano A, Gatsonis C. Generalized estimating equations for ordinal categorical data: arbitrary patterns of missing responses and missingness in a key covariate. Biometrics. 1999;55:488–96. [PubMed]

17. Zhou X. Maximum likelihood estimators of sensitivity and specificity corrected for verification bias. Communications in Statistics - Theory and Methods. 1993;54:124–135.

18. Zhou XH. Effect of verification bias on positive and negative predictive values. Statistics in Medicine. 1994;13:1737–1745. [PubMed]

19. Zhou XH. Nonparametric ML estimate of a ROC area corrected for verification bias. Biometrics. 1996;52:310–316.

20. Zhou XH. Correcting for verification bias in studies of a diagnostic test’s accuracy. Statistical Methods in Medical Research. 1998a;7:337–353. [PubMed]

21. Zhou XH. Comparing the correlated areas under the ROC curves of two diagnostic tests in the presence of verification bias. Biometrics. 1998b;54:453–470. [PubMed]

22. Zhou XH, Rodenberg C. Estimating the ROC curve in the presence of nonignorable verification bias. Commun in Statistics, Theory and Methods. 1998;27:635–57.

23. Zhou XH, Higgs R. Assessing the relative accuracies of two screening tests in the presence of verification bias. Statistics in Medicine. 2000;19:1697–1705. [PubMed]

24. Zhou XH, Obuchowski NA, McClish . Statistical Methods in Diagnostic Medicine. Wiley; New York: 2002.

25. Zhou XH, Castelluccio P. Nonparametric analysis for the ROC areas of two diagnostic tests in the presence of nonignorable verification bias. Journal of Statistical Planning and Inference. 2003;115:193–213.

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |