We consider the logistic regression of disease status *Y* on covariates (*X, ***Z**), where *X* is unobservable and can only be measured with error and the goal is to understand its effect on *Y* . In doing so we assume that the logit of the disease probability is a linear function of **Z** and possibly a nonlinear function of *X* but its exact form is unknown, thus yielding a partially linear logistic model. Although we work with a logistic model, the method is applicable to any distributional model of partially linear form.

The commonly used classical measurement error model assumes that instead of observing

*X*, for every subject one observes, possibly after data transformation, a surrogate

*W* which is unbiased for

*X* involving purely random error with constant variance (

Carroll et al., 2006). However, in large nutritional epidemiological studies, dietary intake for all individuals is usually measured by a food frequency questionnaire (FFQ), which we denote here by

*Q*. It has become well appreciated in the literature that FFQs have substantial measurement error, both random and systematic, therefore violating the classical measurement error assumptions. Our motivating example is the NIH-AARP Diet and Health Study (

http://dietandhealth.cancer.gov/), the details of which can be found in

Schatzkin et al. (2001). Important to note that any case-control study for finding diet-cancer association is subject to differential recall bias between cases and controls. Secondly a homogeneous population with a narrow range of fat intake usually fails to find any association between fat intake and breast cancer. To circumvent these issues, in the NIH-AARP Diet and Health Study a large and diverse population was targeted where diet was assessed prior to diagnosis. In this study, the initial cohort of 617,119 men and women who responded to a FFQ in 1995–1996 has been followed for the evaluation of possible diet-cancer associations. The latest database till the year 2003 contained information on 27 types of cancer, and breast cancer is one of them, which is considered as the response variable in this paper. To adjust for FFQ measurement error in estimated relationships, the NIH-AARP study includes a so-called calibration sub-study with approximately 2000 men and women who, in addition to the FFQ, were administered two non-consecutive 24-hour dietary recalls (24hr) denoted by

*W*. Following the study design, we assume

*W* follows the classical measurement error model. Thus the study design consists of the binary response

*Y* (occurrence or non-occurrence of invasive breast cancer during follow up), covariates without error

**Z**, true unobserved main exposure

*X*, observed exposure

*Q* that measures

*X* with both bias and random error. In addition, a calibration sub-study includes both

*Q* and

*W* along with

**Z**.

To handle the substantial random and systematic measurement error in the FFQ in a semi-parametric fashion, we assume that conditional on (

*X, ***Z**), the surrogate variable

*Q* has a partially linear model, where the effect of

**Z** is linear and the effect of

*X* on

*Q* is still unknown, but assumed to be a smooth and monotone function. In fact for our data example, we found that a linear regression between

*Q* and the average of

*W* is not sufficient to explain the nature of association between these two variables, which is an indication of a possible nonlinear association between

*Q* and

*X*. Although a simple logistic regression of the occurrence of invasive breast cancer (

*Y*) on the percentage of non-alcohol energy from total fat measured via FFQ (Q) revealed a significant association, we want to measure how the risk changes with the true value of the percentage of non-alcohol energy from total fat (

*X*) taking into account that

*X* is unobserved and

*Q* contains substantial measurement error. Therefore, we will model the effect of

*X* on the logit of the success probability of

*Y* via splines, and the effect of

*X* on

*Q* via monotone splines. Since we will develop a likelihood based inference for our models, we need to specify the distribution of the latent variable

*X*. Since the histogram of the average of

*W* obtained from the calibration data and that of

*Q* from the cohort study (see

Web Figure 1) do not show strong evidence for the normal distribution assumption for

*X*, we model the distribution of

*X* nonparametrically via an infinite mixture of normal distributions. In fact the pattern of food intake is likely to vary as the cohort members have a diverse background and are from 6 different US cities and two metropolitan areas with large minority populations.

Before describing the novelty of our disease model and the calibration model and our approach to handling them, we first point out that the regression calibration approach is not applicable in our context. In parametric linear logistic model for the disease probability i.e.,

where

*H*(

*u*) = 1/{1 + exp(−

*u*)} is the logistic distribution function, it has become common to apply regression calibration (RC) adjustment for measurement error, where one regresses

*W* on

*Q* in the calibration sub-study, then replaces

*X* in

(1) by the predictions from this regression. However, this method is well-known to be undesirable in semiparametric models such as the partially linear logistic model, and indeed in our simulations it performs quite poorly. The reason is that RC assumes that in the induced observed data model pr(

*Y* = 1|

*W, ***Z**) the effect of

*W* is conferred only through

*E*(

*X|W, Z*). However, if the effect of

*X* in the logit of pr(

*Y* = 1|

*X, ***Z**) is nonlinear, the actual observed data model may be quite far from the assumed induced observed data model. An example of this is known in ordinary nonparametric regression, where if

*E*(

*Y |X*) = sin(2

*X*),

*X, U* ~ Normal(0, 1), and

*W* =

*X* +

*U*, then the approximation

*E*(

*Y |W*) ≈ sin{2

*E*(

*X|W*)} is systematically biased and out of phase with the true regression function.

Carroll and Hall (1988),

Fan and Truong (1993) and

Delaigle and Hall (2008) considered deconvolution method to deal with classical measurement error in

*X* for a nonparametric regression of

*Y* on

*X*. However, such methods did not consider systematically biased surrogate such as FFQ. Also the methods were not designed to handle the partially linear logistic model.

There are some related Bayesian methodologies, but none of them handle the generality of the problem we confront.

Berry et al. (2002) used smoothing splines and regression splines in the classical measurement error problem to a linear model set up, but not to the important case of binary data.

Carroll et al. (2004) used Bayesian spline-based regression when an instrument is available for all study participants. In addition, both papers assumed that the unknown

*X* is normally distributed.

Mallick and Gelfand (1996) considered covariate measurement error in the generalized linear model with unknown link function, where the distribution of

*X* was modeled via a multivariate normal distribution.

Müller and Roeder (1997) considered the multivariate normal mixture of Dirichlet process prior for handling covariate measurement error in case-control studies. Bayesian nonparametric regression approaches without measurement error in the covariate for binary data have been considered by

Wood and Kohn (1998),

Wood et al. (2002) and

Holmes and Mallick (2003), among others. In summary, all the above mentioned papers considered either (a) nonparametric regression without any measurement error or (b) measurement error in covariates while the regression model is parametrically specified. None of these papers allowed for the partially linear model for the response variable

*Y*, nor did they even begin to address partially linear calibration model to handle systematic bias in FFQ.

Johnson et al. (2007) considered a problem similar to ours, although the data structures are slightly different, since in their example the FFQ is replicated. The important differences are that in place of the semiparametric risk model for

*Y* given (

*X, ***Z**), they used the parametric model (1), and in place of the partially linear calibration model for

*Q* given (

*X, ***Z**), they used a linear model. The important similarity is that they, like us, used a mixture of Dirichlet process (DP) model for the distribution of the latent variable.

In summary the three novel features of our approach are the following. First, we consider a semiparametric logistic model with a nonparametric component subject to measurement error. Second, we allow for the fact that in actual epidemiological practice, the vast majority of the data only have a systematically biased measure of the true risk covariate, and we handle this feature via a semiparametric model with a monotone nonparametric component, the monotonicity being natural in the scientific context. Although the idea of systematic and random bias in

*Q* has appeared in many papers in nutritional epidemiology (

Kipnis et al., 2001,

2003), our consideration of a semiparametric model for the systematic component of the bias is new. Third, we model the distribution of the unobserved covariate nonparametrically via a Dirichlet process mixture of normal distributions. While the use of such models in general is not new (

Johnson et al., 2007), using the idea in a semiparametric context has not previously been investigated. To the best of our knowledge then, this is the first work in the semiparametric measurement error field where two smooth nonparametric functions are estimated simultaneously, one in the exposure-response association, and the other in the association between the surrogate variable and the true exposure, while at the same time treating the distribution of the true exposure variable essentially nonparametrically. Moreover, the estimation of two nonparametric functions, where one is dependent on the other, is not an easy task, especially for binary regression when the covariate distribution is unknown. The simulation study and data analysis show the importance of such flexible models.

An outline of the paper is as follows. The model and assumptions are described in Sections 2, while the method of estimation is described in Section 3. Section 4 contains the data analysis of the NIH-AARP Study. A simulation study is described in Section 5. Section 6 contains concluding remarks.