Search tips
Search criteria 


Logo of nihpaAbout Author manuscriptsSubmit a manuscriptNIH Public Access; Author Manuscript; Accepted for publication in peer reviewed journal;
Biometrics. Author manuscript; available in PMC Jun 1, 2011.
Published in final edited form as:
PMCID: PMC2888615
Semiparametric Bayesian Analysis of Nutritional Epidemiology Data in the Presence of Measurement Error
Samiran Sinha, Bani K. Mallick, Victor Kipnis, and Raymond J. Carroll
Samiran Sinha, Department of Statistics, Texas A&M University, College Station, Texas 77843-3143, sinha/at/;
We propose a semiparametric Bayesian method for handling measurement error in nutritional epidemiological data. Our goal is to estimate nonparametrically the form of association between a disease and exposure variable while the true values of the exposure are never observed. Motivated by nutritional epidemiological data we consider the setting where a surrogate covariate is recorded in the primary data, and a calibration data set contains information on the surrogate variable and repeated measurements of an unbiased instrumental variable of the true exposure. We develop a flexible Bayesian method where not only is the relationship between the disease and exposure variable treated semiparametrically, but also the relationship between the surrogate and the true exposure is modeled semiparametrically. The two nonparametric functions are modeled simultaneously via B-splines. In addition, we model the distribution of the exposure variable as a Dirichlet process mixture of normal distributions, thus making its modeling essentially nonparametric and placing this work into the context of functional measurement error modeling. We apply our method to the NIH-AARP Diet and Health Study and examine its performance in a simulation study.
Keywords: B-splines, Dirichlet process prior, Gibbs sampling, Measurement error, Metropolis-Hastings algorithm, Partly linear model
We consider the logistic regression of disease status Y on covariates (X, Z), where X is unobservable and can only be measured with error and the goal is to understand its effect on Y . In doing so we assume that the logit of the disease probability is a linear function of Z and possibly a nonlinear function of X but its exact form is unknown, thus yielding a partially linear logistic model. Although we work with a logistic model, the method is applicable to any distributional model of partially linear form.
The commonly used classical measurement error model assumes that instead of observing X, for every subject one observes, possibly after data transformation, a surrogate W which is unbiased for X involving purely random error with constant variance (Carroll et al., 2006). However, in large nutritional epidemiological studies, dietary intake for all individuals is usually measured by a food frequency questionnaire (FFQ), which we denote here by Q. It has become well appreciated in the literature that FFQs have substantial measurement error, both random and systematic, therefore violating the classical measurement error assumptions. Our motivating example is the NIH-AARP Diet and Health Study (, the details of which can be found in Schatzkin et al. (2001). Important to note that any case-control study for finding diet-cancer association is subject to differential recall bias between cases and controls. Secondly a homogeneous population with a narrow range of fat intake usually fails to find any association between fat intake and breast cancer. To circumvent these issues, in the NIH-AARP Diet and Health Study a large and diverse population was targeted where diet was assessed prior to diagnosis. In this study, the initial cohort of 617,119 men and women who responded to a FFQ in 1995–1996 has been followed for the evaluation of possible diet-cancer associations. The latest database till the year 2003 contained information on 27 types of cancer, and breast cancer is one of them, which is considered as the response variable in this paper. To adjust for FFQ measurement error in estimated relationships, the NIH-AARP study includes a so-called calibration sub-study with approximately 2000 men and women who, in addition to the FFQ, were administered two non-consecutive 24-hour dietary recalls (24hr) denoted by W. Following the study design, we assume W follows the classical measurement error model. Thus the study design consists of the binary response Y (occurrence or non-occurrence of invasive breast cancer during follow up), covariates without error Z, true unobserved main exposure X, observed exposure Q that measures X with both bias and random error. In addition, a calibration sub-study includes both Q and W along with Z.
To handle the substantial random and systematic measurement error in the FFQ in a semi-parametric fashion, we assume that conditional on (X, Z), the surrogate variable Q has a partially linear model, where the effect of Z is linear and the effect of X on Q is still unknown, but assumed to be a smooth and monotone function. In fact for our data example, we found that a linear regression between Q and the average of W is not sufficient to explain the nature of association between these two variables, which is an indication of a possible nonlinear association between Q and X. Although a simple logistic regression of the occurrence of invasive breast cancer (Y) on the percentage of non-alcohol energy from total fat measured via FFQ (Q) revealed a significant association, we want to measure how the risk changes with the true value of the percentage of non-alcohol energy from total fat (X) taking into account that X is unobserved and Q contains substantial measurement error. Therefore, we will model the effect of X on the logit of the success probability of Y via splines, and the effect of X on Q via monotone splines. Since we will develop a likelihood based inference for our models, we need to specify the distribution of the latent variable X. Since the histogram of the average of W obtained from the calibration data and that of Q from the cohort study (see Web Figure 1) do not show strong evidence for the normal distribution assumption for X, we model the distribution of X nonparametrically via an infinite mixture of normal distributions. In fact the pattern of food intake is likely to vary as the cohort members have a diverse background and are from 6 different US cities and two metropolitan areas with large minority populations.
Before describing the novelty of our disease model and the calibration model and our approach to handling them, we first point out that the regression calibration approach is not applicable in our context. In parametric linear logistic model for the disease probability i.e.,
equation M1
where H(u) = 1/{1 + exp(−u)} is the logistic distribution function, it has become common to apply regression calibration (RC) adjustment for measurement error, where one regresses W on Q in the calibration sub-study, then replaces X in (1) by the predictions from this regression. However, this method is well-known to be undesirable in semiparametric models such as the partially linear logistic model, and indeed in our simulations it performs quite poorly. The reason is that RC assumes that in the induced observed data model pr(Y = 1|W, Z) the effect of W is conferred only through E(X|W, Z). However, if the effect of X in the logit of pr(Y = 1|X, Z) is nonlinear, the actual observed data model may be quite far from the assumed induced observed data model. An example of this is known in ordinary nonparametric regression, where if E(Y |X) = sin(2X), X, U ~ Normal(0, 1), and W = X + U, then the approximation E(Y |W) ≈ sin{2E(X|W)} is systematically biased and out of phase with the true regression function.
Carroll and Hall (1988), Fan and Truong (1993) and Delaigle and Hall (2008) considered deconvolution method to deal with classical measurement error in X for a nonparametric regression of Y on X. However, such methods did not consider systematically biased surrogate such as FFQ. Also the methods were not designed to handle the partially linear logistic model.
There are some related Bayesian methodologies, but none of them handle the generality of the problem we confront. Berry et al. (2002) used smoothing splines and regression splines in the classical measurement error problem to a linear model set up, but not to the important case of binary data. Carroll et al. (2004) used Bayesian spline-based regression when an instrument is available for all study participants. In addition, both papers assumed that the unknown X is normally distributed. Mallick and Gelfand (1996) considered covariate measurement error in the generalized linear model with unknown link function, where the distribution of X was modeled via a multivariate normal distribution. Müller and Roeder (1997) considered the multivariate normal mixture of Dirichlet process prior for handling covariate measurement error in case-control studies. Bayesian nonparametric regression approaches without measurement error in the covariate for binary data have been considered by Wood and Kohn (1998), Wood et al. (2002) and Holmes and Mallick (2003), among others. In summary, all the above mentioned papers considered either (a) nonparametric regression without any measurement error or (b) measurement error in covariates while the regression model is parametrically specified. None of these papers allowed for the partially linear model for the response variable Y, nor did they even begin to address partially linear calibration model to handle systematic bias in FFQ. Johnson et al. (2007) considered a problem similar to ours, although the data structures are slightly different, since in their example the FFQ is replicated. The important differences are that in place of the semiparametric risk model for Y given (X, Z), they used the parametric model (1), and in place of the partially linear calibration model for Q given (X, Z), they used a linear model. The important similarity is that they, like us, used a mixture of Dirichlet process (DP) model for the distribution of the latent variable.
In summary the three novel features of our approach are the following. First, we consider a semiparametric logistic model with a nonparametric component subject to measurement error. Second, we allow for the fact that in actual epidemiological practice, the vast majority of the data only have a systematically biased measure of the true risk covariate, and we handle this feature via a semiparametric model with a monotone nonparametric component, the monotonicity being natural in the scientific context. Although the idea of systematic and random bias in Q has appeared in many papers in nutritional epidemiology (Kipnis et al., 2001, 2003), our consideration of a semiparametric model for the systematic component of the bias is new. Third, we model the distribution of the unobserved covariate nonparametrically via a Dirichlet process mixture of normal distributions. While the use of such models in general is not new (Johnson et al., 2007), using the idea in a semiparametric context has not previously been investigated. To the best of our knowledge then, this is the first work in the semiparametric measurement error field where two smooth nonparametric functions are estimated simultaneously, one in the exposure-response association, and the other in the association between the surrogate variable and the true exposure, while at the same time treating the distribution of the true exposure variable essentially nonparametrically. Moreover, the estimation of two nonparametric functions, where one is dependent on the other, is not an easy task, especially for binary regression when the covariate distribution is unknown. The simulation study and data analysis show the importance of such flexible models.
An outline of the paper is as follows. The model and assumptions are described in Sections 2, while the method of estimation is described in Section 3. Section 4 contains the data analysis of the NIH-AARP Study. A simulation study is described in Section 5. Section 6 contains concluding remarks.
2.1 Outline
In this section, we give the model structure for the observed data (Y, Q, Z, W) given the latent long term diet X(Section 2.2), the model structure for X (Section 2.3), and the models for the semiparametric functions that are in the model (Section 2.4).
2.2 Model Structure for Observed Data
Let Y be the binary response variable, X the covariate of interest, Q the surrogate variable for X, and Z a vector of error-free covariates. The primary data consist of (Yi, Qi, Zi) for i = 1, (...), n. In order to obtain information about the relationship of Qi and Xi we assume that there is an external calibration data where we observe Q, Z, and repeated measurements of some unbiased surrogate variable W. Therefore, the calibration data are (Qi, Zi, Wij) for j = 1, (...), J, and i = n + 1, (...),N, where N = n + m. The basic risk model is
equation M2
where θrisk(·) is an unknown function, and Q is related to (X, Z) via
equation M3
where θcal(·) is an unknown smooth monotone function. We further assume that the within-person random measurement error UQi is nondifferential, and we make the standard assumption that UQi follows a normal distribution with mean zero and variance equation M4. Let Δi be an indicator variable corresponding to subject i, which takes on value 1 if the ith subject belongs to the calibration study and 0 otherwise. Therefore, when Δi = 1 we observe (Qi, Zi, Wij, j = 1, (...), J), and when Δi = 0 we observe (Yi, Qi, Zi) for the ith subject. For identifiability, we require that the distribution of Q given (X, Z) is the same in both primary and calibration data, e.g. the latter is a randomly chosen subset of the primary data.
For the unbiased surrogate variable, there are j = 1, (...), J replicates and we assume Wij = Xi+UWij. Furthermore, we assume that conditional on X, UWij are independent and identically distributed Normalequation M5, and conditional on (X, Z), UQi is assumed to be independent of UWij . Note that the assumptions of normality and independence for the errors UQi and UWij can be relaxed to any bivariate parametric model.
2.3 Distribution of the Unobserved Covariate X
A likelihood based approach requires a distribution for the unobserved covariate X. Since the misspecification of this distribution may result in biased estimates of the quantities of interest, we model the distribution in a flexible semiparametric fashion. We assume that conditional on Zi, μi, and equation M6, and assume that a priori the parameters equation M7 come from a distribution G, that means, equation M8, and a priori G ~ DP0G0), where G0 is the base probability measure defined on (χ, β). Here χ = (−∞,∞) × (0, ∞) and β is the σ−algebra generated by χ. Under G0, [μ|σ2, τ ] ~ Normal(0, τσ2), σ2 ~ IG(aσ, bσ), and the hyperparameter τ ~ IG(aτ, bτ), where IG(aτ, bτ ) denotes the inverse-gamma distribution with mean 1/{bτ (aτ − 1)}. Since for any set A, G(A) has mean G0(A) and variance G0(A){1 − G0(A)}/(α0+1), the parameter α0 plays an important role in determining the concentration of the random process G around the base probability measure G0. The stick-breaking representation of the Dirichlet process indicates that G is almost surely a discrete probability measure with infinite many mass points, which in turn implies that the distribution of X is an infinite mixture of normal distributions, and the parameters of the component distribution come from the base probability measure G0. In addition, the mixing probabilities πk’s are obtained via equation M9, where γk = Beta(1, α0), and equation M10.
2.4 Semiparametric Models for θrisk(•) and θcal(•)
We wish to estimate θrisk(X) in the interval [a, b] and for this purpose we use B-splines. Let equation M11 be a set of cubic spline basis functions for krisk fixed knot points. With p = krisk + 4, our model is
equation M12
Also based on kcal fixed knot points, with q = kcal + 4, we express θcal(·) as
equation M13
where equation M14 is also a set of cubic spline basis. The use of cubic splines implies that the nonparametric functions have continuous second-derivatives. The chosen knot points for the two nonparametric functions could be the same, and consequently the spline basis functions would be the same. In order to estimate θrisk(•) properly, θcal(•) should be a monotonic nondecreasing function of its argument, which is a natural assumption regarding the conditional mean of Q given X. Because the B-spline basis functions are nonnegative, under the monotonicity constraint for θcal(X), the first derivative equation M15 is non-negative if the coefficients αj’s are nondecreasing, i.e., α1 ≤ α2(...) ≤ αq. This property of B-splines has been previously used by Leitenstorfer and Tutz (2007) among many others. Also we chose a moderately large number of knot points which will make the inference insensitive to the location of knots (Ruppert, 2002). Next we describe the method of estimation within the Bayesian paradigm.
Let βT = (β1, (...)p), αT = (α1, (...), αq), [var phi] = ([var phi]1, (...),[var phi]n+m), and define equation M16. The observed data likelihood function is
equation M17
equation M18
equation M19
Before describing prior specification and posterior inference, we would like to mention that one could adopt a complete nonparametric method by modeling the joint distribution of X, Q and Z for Y = 0 and Y = 1 using dependent Dirichlet process prior. However, that method would involve multidimensional Dirichlet process integrals which is quite challenging in terms of computation. Therefore, to reduce the dimensionality and computational complexity of the problem we model the distribution of X conditional on Z where Z is always observed without any error.
3.1 Specification of Priors
In order to avoid oversmoothing due to large number of knot points in the spline models, following Ruppert et al. (2003) we use the following prior distribution for α and β, equation M20, where δα and δβ are the two penalty parameters corresponding to α and β. Note that the priors are proper for any choice of V > 0, and we set V = 108. Also, Δ represents the first order difference operator, i.e., Δαj = αj − αj−1, and Δ2αj = αj − 2αj−1 + αj−2. We assume δα ~ Gamma(aα, bα) and δβ ~ Gamma(aβ, bβ). For equation M21, we use IG(aQ, bQ) and IG(aW, bW) priors respectively. For each component of ζcal, ζrisk, ζx we use a Normalequation M22 prior.
As mentioned earlier, for equation M23 and a priori G ~ DP0G0). The Dirichlet process prior involves two parameters G0 and α0. Note that the tuning parameter of the Dirichlet process prior α0 has a dual role. On the one hand it determines how concentrated G is around G0, and on the other hand it determines the number of clusters, i.e., the number of distinct mass points of G. Hence the choice of α0 is an important one. Following Escobar and West (1995) one may put a prior on α0. However, we will estimate α0 empirically following an existing characterization of its maximum likelihood estimator (McAuliffe et al, 2006). In summary, for our proposed method one needs to specify the following quantities: V, aα, bα, aβ, bβ, aQ, bQ, aW, bW, equation M24 and G0.
3.2 Posterior Inference
Our primary goal is to estimate ζrisk, ζcal, β, α, equation M25. Posterior means are used as the parameter estimates. As we can see the posterior means cannot be calculated analytically, we propose an efficient posterior sampling algorithm for simulating random numbers from the posterior distributions, and the details are given in the Web Appendix. At each MCMC iteration we estimate α0 empirically by solving the following equation:
equation M26
where the right hand side of (4) is obtained by taking average of k, the number of distinct [var phi]i’s, of the last 30 Gibbs sampling iterations, i.e., we take equation M27.
Note, as [var phi]1, (...),[var phi]N are independent and identically distributed observations from G, and G has a DP0G0) prior, the posterior distribution of G given [var phi]1, (...),[var phi]N is again a equation M28, Thus for a future realization equation M29. Therefore, the predictive distribution of [var phi]N+1|[var phi]1, (...),[var phi]N is equation M30.
4.1 Background
In this illustration of our methodology, we consider the association between percentage of non-alcohol energy from total fat and the risk of invasive breast cancer (Y) in the NIH-AARP cohort. Here are some details about this study.
During 1995–1996 3.5 million baseline questionnaires were mailed to current members of the AARP (formally the American Association of Retired Persons), aged 50–71 years, and 567,169 members completed them satisfactorily (Schatzkin et al, 2001). Of these, the investigators excluded an additional withdrawal (n=1), duplicate records (n=179), subjects who moved out of the eight states included in the study before returning the baseline questionnaire (n=321), or were found to have died before study entry (n=261). After these exclusions, the NIH-AARP baseline cohort included 566,407 persons that we considered in our analysis. Thiébaut et al. (2007) analyzed this cohort with non-alcohol energy from total fat as well as fatty acid as the main exposure variables, in addition to several potential confounders. In our illustration, we use non-alcohol energy from total fat as the main exposure and body mass index (BMI) measured in the metric unit kg/m2 as the error-free covariate Z. We considered only postmenopausal women who did not report a personal history of any cancer, did not develop breast cancer in situ during the follow-up and whose BMI values were not missing. Of the remaining 144, 366 subjects we excluded those women who reported extreme values (i.e., more than two interquartile range above the 75th percentile or below the 25th percentile on the logarithmic scale) for total fat, energy, or percentage of non-alcohol energy from total fat, thus leaving us with 142, 364 subjects, of whom 2, 724 women developed invasive breast cancer during the follow-up period. The average follow-up time for these subjects was 4.52 years. Based on the daily alcohol consumption in grams, total fat intake in grams, and total energy intake in Kcals, the percentage of non-alcohol energy from total fat was calculated by using the following formula
equation M31
where energy from total fat was obtained by multiplying total daily consumption of fat by 9 Kcals, and non-alcohol energy was calculated by subtracting energy from alcohol from total energy. Energy from alcohol was computed by multiplying alcohol consumption measured in grams by 7 Kcals. The percentage of non-alcohol energy from total fat measured from the FFQ was treated as the surrogate measurement Q of X.
The NIH-AARP calibration sub-study was designed to calibrate the FFQ used in the main study with two non-consecutive 24-hour recall telephone interviews as the unbiased surrogate measure. While there is concern that 24-hour recalls may not be actually unbiased for true intake, these are the data we have and their use can be justified as a so-called alloyed gold standard. Out of the 1953 subjects of both genders we considered only the 919 women who completed two 24-hour recall interviews and whose BMI values were not missing. Note that these women were all postmenopausal and cancer free at the start of the study and they were a random sample from the main study population. The percentage of non-alcohol energy from total fat obtained from the two 24-hour recalls is considered as the unbiased surrogate, W. We defined the response variable Y as one if a person develops invasive breast cancer during the follow-up period and zero otherwise. Before the analysis we transformed BMI by dividing by 10, and X was the logarithm of the percentage of non-alcohol energy from total fat. In the preliminary analysis we fit model (1) without any adjustment for measurement error using the primary data. The result showed that X was significantly positively associated with the risk of invasive breast cancer with an odds ratio 1.176 with a 95% confidence interval (1.016, 1.363) and p–value= 0.030. Also, we found that BMI was positively associated with the disease with an odds ratio 1.069 with a 95% confidence interval (1.008, 1.135) and p–value= 0.026.
4.2 Semiparametric Results
First, we used our semiparametric Bayesian (SPB) approach. We experimented with 15, 11, 9, and 6 knot points, and the results were fairly insensitive to the number of knot points. Therefore here we discuss the results for 6 knot points: (2.9, 3.0, 3.2, 3.4, 3.6, 3.80), and we used cubic B-splines, so there were p = 10 spline basis functions. We used the same knot points for θrisk(·) and θcal(·). For δα and δβ we used Gamma(0.001, 1000) priors. We used IG(25, 0.25) prior for equation M32, and Normal(0, 102) prior for ζrisk, ζcal and ζx. For the distribution G0 of the Dirichlet process prior, the priors chosen were τ ~ IG(2.1, 2.0) and σ2 ~ IG(2.1, 0.91). The prior parameters were chosen in such a way that the priors cover a wide range of values for the parameter of interest. We ran the Gibbs sampling for 20,000 iterations and we discarded the first 5000 iterations as burn-in samples. The choice of the initial parameters is an important step for proper and timely convergence of the MCMC method, and in the next section we discuss this issue in detail. The solid lines of Figure 1 show the estimated θrisk(·) along with a 95% credible interval. Overall, the method shows that the risk of having breast cancer increases with the logarithm of the percentage of non-alcohol energy from total fat. Figure 2 shows the estimated θcal(·) function along with 95% credible interval. This figure also shows the scatter plot of (Q[zeta]calZ, w) from the calibration data. Both figures clearly indicate that there is some sort of nonlinearity in both θrisk(·) and θcal(·). Table 1 presents the posterior mean, posterior standard deviation, and 95% credible intervals for ζrisk, ζcal, ζx, equation M33.
Figure 1
Figure 1
Plot of the estimated θrisk(X) along with a pointwise 95% credible interval using the semiparametric Bayesian approach (SPB, solid lines) and the parametric Bayesian approach (PRB, dashed lines) approach.
Figure 2
Figure 2
This figure shows the estimated θcal(X) using the semiparametric Bayesian (SPB) method, along with a 95% pointwise credible interval. This figure also contain the scatter plot of (Q[zeta]calZ, w) of the (more ...)
Table 1
Table 1
Results of the NIH-AARP data analysis by the semiparametric Bayesian (SPB) and parametric Bayesian (PRB) methods.
Using the Dirichlet process prior we found that the average number of mixing components is 2.11 with a posterior standard deviation 0.33, indicating that the true distribution of X given Z is not a normal distribution but a mixture of normals which is also evident from the histogram of a MCMC sample of X values (Figure 3). Figure 4 shows the histogram of α0, the tuning parameter of the Dirichlet process prior. Web Figure 2 shows a plot of the number of clusters of the Dirichlet process process prior in the MCMC samples. Following a referee’s comment we also have added a summary of cluster sizes using boxplot based on the MCMC sample (Web Figure 3) when the number of clusters are 2, 3, or 4. Web Figure 4 shows an estimated equation M34 based on the MCMC sample of [var phi]N+1.
Figure 3
Figure 3
Histogram of X obtained from the MCMC sample using our semiparametric Bayesian (SPB) method along with the density plot of a normal distribution with the same mean and variance as of X.
Figure 4
Figure 4
An empirical distribution of α0 of the Dirichlet process prior which is determined according to Equation (4) in each MCMC samples of the semiparametric Bayesian (SPB) method.
The null hypothesis of interest is that there is no risk of fat on breast cancer: under this null hypothesis θrisk should be constant and thus it is equivalent to test β1 = (...) = βp as equation M35. Suppose we consider M MCMC samples taking every 300th observation after the burn-in period. Then the test statistic for testing the null hypothesis is F = [(Mp + 1)/{(M −1)(p−1)}]T2, where equation M36. Here Dkj = βkj − βpj, βkj is the jth value of βk out of M observations, equation M37. Under the null hypothesis, for sufficiently large M, the statistic follows an F(p−1),(M−p+1) distribution. The observed value of the test statistic was 14.78 with a p-value 2.01 × 10−10. Thus, we rejected the null hypothesis at 1% level of significance, and concluded that θrisk was not a constant function. Before conducting this test we made sure that there is no significant correlation between Dkj and equation M38for any k. Note that if X changes from 3.0 to 3.1, according to our semiparametric Bayesian method, θrisk(3.0) = −4.226 and θrisk(2.9) = −4.223, and thus the risk increases by 0.35%. In contrast, θrisk(3.4) = −4.113, and θrisk(3.3) = −4.162 thus the risk increases by 5.03% for changing X from 3.3 to 3.4, which signifies that the rate of change of the relative risk is not constant in our method. In the SPB method, the estimated odds ratio for BMI is 1.057 with a 95% credible interval (1.021, 1.087) for an increase of 10kg/m2 BMI. In addition, following a referee’s comment we fitted model (1) by replacing X by X = [eta w/ hat]0 + [eta w/ hat]1Q + [eta w/ hat]3Z, where [eta w/ hat]0, [eta w/ hat]1, [eta w/ hat]3 are the MLE of the simple linear regression of the average of W on Q and Z based on the sub-study. The estimated odds ratio is 1.36 for one unit change in X, with a p-value= 0.03, or more specifically the risk increases by 3.04% for 0.1 unit increase in X for all values of X.
4.3 Parametric Calibration Model
One of the unique features of our approach is the semiparametric model for the calibration function given in (3), where θcal(·) is an unknown smooth monotone function. In addition, we model the distribution of X in a semiparametric manner. Here we compare the results to those that assume this calibration function is linear in the unmeasured X, and that the unmeasured X is normally distributed given Z. With a slight abuse of terminology we will refer this method as the Parametric Bayes (PRB) approach. In the PRB method we modeled the surrogate variable as Qi = ψ01Xi+Ziζcal+UQi, UQi ~ Normalequation M39 and assumed that Xi ~ Normalequation M40. All other components and assumptions remained the same as the SPB approach. The parameters were estimated by MCMC. For this method we used the same prior that we had used in the SPB method. For ψ0, ψ1, and μ0 we used Normal(0, 102) priors, and used IG(2.1, 0.91) prior for equation M41. The posterior mean, standard deviation and 95% credible intervals for the parameters are given in Table 1. It is obvious that the estimate of equation M42 is significantly larger in the PRB approach compared to the SPB method, which is an indicator that possibly E(Q|X, Z) is better explained by the partly linear function θcal(X) + Zζcal than a linear function of X and Z. The dashed lines in Figure 1 show the estimated θrisk for the Parametric Bayes approach along with a 95% credible interval. For this method, using the above F-test we again concluded that θrisk is not a constant function in the interval of interest.
4.4 Model Checking
It is possible to compare the SPB and PRB models, i.e., to compare whether either θcal(·) is nonlinear or the distribution of X is non-normal. In order to compare these two models we calculated the marginal likelihood under each model. Suppose ΘT = (θT,[var phi]) is the set of all parameters, then the marginal likelihood under the Semiparametric Bayes model is equation M43, where the expressions for f(Yi, Qi|Zi; Θ) and f(Qi, Wij |Zi; Θ) are given in Section 3. Following Newton and Raftery (1994) we estimate the marginal likelihood via
equation M44
where Θ(1), (...)(M) are M draws from the posterior distribution of Θ. Note that f(Yi, Qi|Zi; Θ(m)) = ∫ f(Yi, Qi|Xi, Zi(m))f(Xi|Zi(m))dXi and f(Qi, Wij |Zi; Θ(m))=∫ f(Qi, Wij |Xi, Zi; Θ(m)) f(Xi|Zi, Θ(m))dXi: these integrals were calculated by numerical quadrature. In a similar manner we estimated mPRB via mPRB. Finally, we obtain mBSP/ mPRB = 111.21, which suggest that the SPB fits the data far better than does the PRB model.
To test our methodology, we performed a simulation study that has some of the complexities of the NIH-AARP study, in terms of design and nonlinearity. First we created a cohort of size n = 10, 000 with Y, X, Q, W1 and W2. We considered two different cases. In case I we simulated X from the Normal(γ0 = 3.5, σX = 0.5) distribution and in case II X was simulated from (1/2)Gamma(65, 0.0455) + (1/2)Normal(3.8, 0.12). The second case clearly indicates that the distribution of X is not normal. For each case, after simulating X, we generated the unbiased variable W as Wij = Xi + UWij, where equation M45, j = 1, 2. We took two different values for σW, 0.2 and 0.5, where the value 0.5 mimics the large measurement error situation common in nutritional epidemiology. The surrogate variable Q was generated as Qi = θcal(Xi) + UQi, equation M46 and we used two different values for σQ, 0.1 and 0.2. We considered a situation where θcal(X) = 1.25 + 0.6X, a linear function of X and another situation where θcal(X) = 2 + 3 exp{4(X − 3.5)}/[1 + exp{4(X − 3.5)}]. Conditional on X, the binary disease variable Y was generated with success probability Hrisk(X)}, where θrisk(X) = −2.8 + 1.5 exp{10(X − 3.5)}/[1 + exp{10(X − 3.5)}]. On average the above probability resulted in 10% − 12% subjects with Y = 1. To create a calibration data from the same cohort we randomly drew m = 1000 subjects and obtained the variables Q, and the unbiased surrogate variables W1 and W2. Note that for the primary data we only considered (Yi, Qi), i = 1, (...), 10, 000 whereas for the calibration data we considered (Qj, Wj1, Wj2), j = 1, (...), 1000.
Under each scenario we generated R = 200 data sets, and each simulated data set was analyzed by the SPB approach and by the PRB approach. In both the SPB and PRB approaches, the nonparametric effect of X on the disease risk θrisk was modeled via B-splines with 9 knot points, and the knots were placed at every 10th percentile of the distribution of the average values of W starting with the 10th percentile. One should note the two main differences between the SPB and PRB approach. In the PRB approach X was always modeled as a normal random variable whereas in the SPB method X was modeled as a Dirichlet process mixture of normal random variable. In addition, in the PRB approach, θcal(X) was assumed to be linear in X, whereas in the SPB method we do not assume any specific form for θcal, rather we simply impose the restriction that θcal(X) is a smooth and a monotone nondecreasing function of X. One can of course construct many other methods to be compared with the SPB method. We have chosen to use the PRB approach to investigate the extent to which less flexible models in terms of the distribution of the latent covariate X and the calibration model affect results.
For both the SPB and PRB methods we used a Gamma(0.001, 1000) distribution for δβ. In the analysis, we used IG(25, 0.25) prior for equation M47. The PRB approach involves ψ0 and ψ1, and we put a Normal(0, 102) prior on each of them. Also, in the PRB approach we used Normal(0, 102) prior for μ0 and IG(2.1, 0.91) prior for equation M48. In the SPB method we need to estimate θcal(X) using B-splines, which involves αj, j = 1, (...), q, and we used a Gamma(0.001, 1000) prior for δα. The previously mentioned knot points are also used for estimating θcal(X). In the Dirichlet process prior we used IG(2.1, 2) and IG(2.1, 0.91) priors for τ and σ2 respectively.
For the initial values of X and α, we first regress (W1 + W2)/2 on Q in the calibration study. Let [eta w/ hat]0, [eta w/ hat]1 and equation M49 be the estimated intercept, slope parameter, and the residual variance of the regression obtained using the calibration study. For the true exposure variable we set the initial value of Xi = [eta w/ hat]0 + η1Qi, for i = 1, (...), n and set Xi = (Wi1 + Wi2)/2 for i = (n + 1), (...), (n + m). The initial value of σ was obtained by fitting penalized nonparametric regression through {Qi, (Wi1 + Wi2)/2} using the calibration data. We set the initial penalty parameters δα = 0.005 and δβ = 0.005.
The performance of the methods were assessed via the following two statistics. The major concern of the paper is how well we estimate the risk function θrisk(X). We considered a set of grid points from 2.8 to 4.2 with 0.01 increment. Letting gj be the jth grid point, for the sth data set, we estimated θrisk(gj) by [theta w/ hat]s,risk(gj) and let equation M50. We computed integrated square bias as equation M51 and integrated mean square error as equation M52. In addition to the above methods, each data set was analyzed by the naive method where we estimated θrisk(X) using the data on (Yi, Qi) and assuming Xi = Qi. For this method we modeled θrisk(X) via B-splines and estimated β and δβ using the logistic likelihood in a Bayesian framework.
The results of the simulation study are presented in Table 2. From the results one can quickly make the following important observations. For both the SPB and PRB methods, bias increases with σQ and σW. The value of integrated squared bias due to PRB method is significantly higher than the SPB approach when the underlying model assumptions of the PRB approach are violated. When E(Q|X) is a linear function of X, and X follows a normal distribution, the performance of the SPB and PRB methods are equivalent for moderate values of σW. Note that the integrated squared bias of the naive method is significantly larger than that for the SPB method. Also we found that except the situation when θcal(X) is a nonlinear function of its argument and X is generated from a mixture distribution, the bias of the PRB approach is smaller than that for the naive method. Generally the variance of the SPB method is higher than the PRB approach, and the variance of the naive approach is smaller among all three approaches.
Table 2
Table 2
Results of the simulation study based on 200 simulations.
Along with the proposed method each data set was analyzed by the regression calibration technique assuming X was a linear function of Q. As expected from Carroll, et al. (2006), the method behaved very badly in comparison to the SPB and PRB methods, and we do not present the results here.
In this paper, we considered the logistic regression when an error-prone covariate X is modeled nonparametrically, while exactly measured covariates Z are modeled parametrically, the result being a partially linear model. The logistic framework could be readily generalized to any distributional model.
There are two additional important and novel components to our approach. First, we recognize that in current practice, the surrogate Q for the unobserved X is not unbiased for X, and hence we have a situation where the classical measurement error model does not hold. We model the relationship of Q to (X, Z) also in a partially linear fashion, while forcing the nonparametric function θcal(X) for X to be monotone.
Finally, in a small subset of individuals, a so-called calibration study, we observed replicated unbiased versions W of X. We then model the distribution of the unobserved X nonparametrically through a Dirichlet process. As stated in the introduction, this appears to be the first paper in semiparametric measurement error models where the two functions, one of them monotone, are estimated jointly nonparametrically, while the distribution of X is modeled nonparametrically.
The simulation study shows that our method generally vastly outperforms the regression calibration method. In terms of integrated mean square error our SPB method shows much better performance than the less flexible PRB approach. The simulation study also indicates that the prices of no model assumption (naive method) and wrong model assumption (PRB method) could be significant. Instead of a single covariate with nonlinear effect, our method could also be used for the generalized additive model with multiple covariates.
One of the challenging parts of the method is the computation which requires appropriate choice of the initial values of the parameters and proposal distributions of the Metropolis-Hastings algorithm for the Gibbs sampling which are discussed in the appendix. The computation were done using Fortran and R, and the computer code is available at ~sinha/research.html.
The research of Mallick and Carroll was supported by grants from the National Cancer Institute (CA57030, CA104620) and in part by Award Number KUS-CI-016-04, made by King Abdullah University of Science and Technology (KAUST).
Supplementary Materials
Web Appendix, and Figures referenced in Sections 1, 3, and 4 are available under the Paper Information link at the Biometrics website
Contributor Information
Samiran Sinha, Department of Statistics, Texas A&M University, College Station, Texas 77843-3143, sinha/at/
Bani K. Mallick, Department of Statistics, Texas A&M University, College Station, Texas 77843-3143, bmallick/at/
Victor Kipnis, Biometry Research Group, Division of Cancer Prevention and Control, National Cancer Institute, Bethesda, Maryland, kipnisv/at/
Raymond J. Carroll, Department of Statistics, Texas A&M University, College Station, Texas 77843-3143, carroll/at/
  • Berry SM, Carroll RJ, Ruppert D. Bayesian smoothing and regression splines for measurement error problems. Journal of the American Statistical Association. 2002;97:160–169.
  • Carroll RJ, Ruppert D, Tosteson TD, Crainiceanu C, Karagas MR. Nonparametric regression and instrumental variables. Journal of the American Statistical Association. 2004;99:736–750.
  • Carroll RJ, Ruppert D, Stepanski LA, Crainiceanu C. Measurement Error in Nonlinear Models: A Modern Perspective. Second Edition. New York: Chapman and Hall; 2006.
  • Carroll RJ, Hall P. Optimal rates of convergence for deconvolving a density. Journal of the American Statistical Association. 1988;83:1184–1186.
  • Delaigle A, Hall P. Using SIMEX for smoothing parameter choices in errors in variables problems. Journal of the American Statistical Association. 2008;103:280–287.
  • Escobar MD, West M. Bayesian density estimation and inference using mixtures. Journal of the American Statistical Association. 1995;90:577–588.
  • Fan J, Truong YK. Nonparametric regression with errors in variables. Annals of Statistics. 1993;21:1900–1925.
  • Gilks WR, Wild P. Adaptive rejection sampling for Gibbs sampling. Applied Statistics. 1992;41:337–348.
  • Holmes CC, Mallick BK. Generalized nonlinear modeling with multivariate free-knot regression splines. Journal of the American Statistical Association. 2003;98:352–368.
  • Johnson BA, Herring AH, Ibrahim JG, Siega-Riz AM. Structured measurement error in nutritional epidemiology: applications in the pregnancy, infection, and nutrition (PIN) study. Journal of the American Statistical Association. 2007;102:856–866. [PMC free article] [PubMed]
  • Kipnis V, Midthune D, Freedman LS, Bingham S, Schatzkin A, Subar A, Carroll RJ. Empirical evidence of correlated biases in dietary assessment instruments and its implications. American Journal of Epidemiology. 2001;153:394–403. [PubMed]
  • Kipnis V, Subar AF, Midthune D, Freedman LS, Ballard-Barbash R, Troiano R, Bingham S, Schoeller DA, Schatzkin A, Carroll RJ. The structure of dietary measurement error: results of the OPEN biomarker study. American Journal of Epidemiology. 2003;158:14–21. [PubMed]
  • Leitenstorfer F, Tutz G. Generalized monotonic regression based on B-splines with an application to air pollution data. Biostatistics. 2007;8:654–673. [PubMed]
  • Mallick BK, Gelfand AE. Semiparametric errors-in-variables models: A Bayesian approach. Journal of Statistical Planning and Inference. 1996;52:307–321.
  • McAuliffe JD, Blei DM, Jordan MI. Nonparametric empirical Bayes for the Dirichlet process mixture model. Statistics and Computing. 2006;16:5–14.
  • Müller P, Roeder K. A Bayesian semiparametric model for case-control studies with errors in variables. Biometrika. 1997;84:523–537.
  • Newton MA, Raftery AE. Approximate Bayesian inference with the weighted likelihood bootstrap. Journal of the Royal Statistical Society, Series B. 1994;56:3–26.
  • Ruppert D. Selecting the number of knots for penalized splines. Journal of Computational and Graphical Statistics. 2002;11:735–757.
  • Ruppert D, Wand MP, Carroll RJ. Semiparametric Regression. Cambridge University Press; 2003.
  • Schatzkin A, Subar AF, Thompson FE, Harlan LC, Tangrea J, Hollenbeck AR, Hurwitz PE, Coyle LSNMDS, Freedman LS, Brown CC, Midthune D, Kipnis V. Design and serendipity in establishing a large cohort with wide dietary intake distributions. American Journal of Epidemiology. 2001;154:1119–1125. [PubMed]
  • Thiébaut ACM, Kipnis V, Chang S-C, Subar AF, Thompson FE, Rosenberg PS, Hollenbeck AR, Leitzmann M, Schatzkin A. Dietary fat and postmenopausal invasive breast cancer in the National Institutes of Health – AARP diet and health study cohort. Journal of the National Cancer Institute. 2007;99:451–462. [PubMed]
  • Wood S, Kohn R. A Bayesian approach to robust binary nonparametric regression. Journal of the American Statistical Association. 1998;93:203–213.
  • Wood S, Kohn R, Shivley T, Jiang W. Model selection in spline nonparametric regression. Journal of the Royal Statistical Society, Series B. 2002;64:119–139.