We performed a series of simulation experiments to investigate the performance of the proposed procedure in various settings. We consider a case when the disease status D
is binary (i.e. K
=1). Note that
Thus, the parameters (β0
, β, Θ, η) are sufficient to identify pr(D
), i.e., κ = β1
is identified from (β0
, β, Θ, η). This means that simply using (5)
as a likelihood function directly will be unstable because of over-parametrization. To overcome this, we may re-parametrize in terms of pr(D
= 1) through (11)
. In addition, let κ be a function of pr(D
= 1). This obvious solution can solve the over-parametrization problem.
The genotype G
was simulated under HWE for I
= 1,2,3. Given the values of (G
, X), we generated a binary disease outcome D
using two logistic models, corresponding to the GEM and AEM. For the GEM, covariates are related to a disease via link function
and the corresponding AEM was obtained by setting coefficients βDi
to be 0. Here we omit the subscription k
in the regression parameters βs
since we have one level disease cases and normal controls.
We considered three settings of the disease risk function. In the first setting, only one marker is involved in a disease. This marker has weak additive and dominance effects (βA1 = log(1.5) and βD1 = log(1.3)), while the interaction effect with the environment is strong for both additive and dominance components (βXA1 = log(2.5) and βXD1 = log(3)). In the second setting, in addition to the marker described above, we added one with stronger additive component (βA2 = log(2.2)) and almost no dominance βD2 = log(1.1). Interaction effects of both additive and dominance components are strong (βAX2 = log(2.2) and βDX2 = log(2.5)). Finally, in the third setting, we added one additional marker with strong additive and dominance components in addition to the strong interaction effects βA3 = log(2), βAX3 = logO), βD3 = log(2), βDX3 = log(3).
THE DISCRETE CASE
Consider the case when disease status and environmental variables (X, W
) are binary. Let pr(X
= 1) = 0.5. We simulated the observed environmental variable W
using the following misclassification probabilities. pr(W
= 1) = 0.25 for the exposed participants and pr(W
= 0) = 0.20 for the non-exposed. We performed a simulation sub-study when probability of disease is not known and it is estimated via grid-search method. The values of πd
are set to be on interval (0.001, 0.04) with step 0.005 and the resulting estimate is a value that maximizes the pseudo-likelihood function. Parameter β0
is estimated by solving equation (11)
. To estimate the parameters, 500 samples are simulated and each sample contains 1,000 cases and 1,000 controls. To illustrate performance and advantages of the proposed method, we presented biases and Root Mean-Squared Errors (RMSE).
Shown in the are simulation results for the case when three independent genetic markers are observed. The results illustrate that the proposed methodology produced parameter estimates that are nearly unbiased and have small variability. The naive approach that ignores existence of the measurement error and pretends that the environment is observed exactly results in biased estimates with variability that is larger than that of the proposed approach. These simulation results illustrate that the RMSE of coefficients βDi; and βDXi are generally larger than those of βAi and βAXi, thus suggesting that the dominance effect should only be used in the situations when the data present strong evidence for the dominance effect. In Web Tables and , we present the results for two-marker case, additive and genotype effect models, respectively. Findings shown in the Web Tables and are similar to those of the .
Biases and RMSEs of risk parameters for the naive approach that ignores existence of measurement error and the proposed method in the case when pr(D = 1) is known and when it is estimated
Biases and RMSEs of risk parameters for the naive approach that ignores existence of the LD and the proposed method
The setup described above simulates markers that are in linkage equilibrium. To evaluate performance of the proposed method in the case when genetic markers are in the LD, we considered the following simulation setup. We simulated the observed genotype according to the following frequencies:
We further compared performance of the proposed approach and the one that ignores existence of the LD. We found that when the LD is small (e.g. 0.005), the parameter estimates based on the model (5)
are nearly unbiased and have small variability (Web ), because the coefficients capture enough of the LD information. To test the performance of the proposed models in the case when moderate amount of LD is present (Δ = 0.01, 0.02, 0.05) and compare it to the procedure that ignores existence of the LD, we performed the following simulation experiment. We simulated the observed data using a simulation setup described above with two markers that are in LD using AEM. Results presented in the illustrate that the naive approach resulted in parameter estimates that are biased and are highly variable, while the proposed method eliminated bias and substantially reduced the variability of the estimates.
Biases and RMSEs of risk parameters for the naive approach that ignores existence of measurement error and the proposed method
In this simulation experiment, we considered a continuous environmental variables. We simulated the true environmental covariate X from a Normal distribution with zero mean and variance 0.1. To simulate observed environmental variables, we used additive model of the form W = X + U, where U is generated from the normal distribution with zero mean and variance ξ = 0.25. Note that we are simulating a case of large measurement error, to mimic a situation that occurs in practice while measuring diet. To estimate the probability of disease, we used grid-search method on the interval (0.001, 0.051) with step 0.005 by maximizing the pseudo-likelihood function for values of probability of disease fixed on a grid and then performing a grid-search to identify the value of probability of disease that maximized the likelihood.
Within this simulation setup, we suppose that the measurement error distribution is known. presents the results of three-marker case under the additive effect model. We found that for our method there is no noticeable bias in parameter estimates, whereas the naive approach that ignores existence of the measurement error results in substantial bias (). The RMSEs of coefficients βAi in are reasonable. However, the RMSEs of βAXi in are generally larger because they are based on the continuous covariate with noise that is 2.5 times more variable than the signal. We are giving a very stringent test to our method in the case when the environmental covariate is continuous because in practice the measurement error is massive. In the Web Tables and , we present the results of one marker case for additive effect model and genotype model, respectively; in Web , we report the results of two-marker case for the additive effect model. The three Web Tables provide similar results as those of .
Standard errors (SE) of risk parameters for the proposed approach
Estimates and 95% Wald confidence intervals of parameters for the colorectal adenoma study
To investigate accuracy of the proposed variance estimator, we performed an experiment and reported results in . The results suggest that the mean estimated standard error is nearly unbiased. However, the variability of the parameter estimates is elevated. This phenomena is well known in the measurement error literature and noted in our previous work [Lobach et al., 2008
]. When the measurement noise is large, which is the case in our situation, the sampling distribution of the parameter estimates can be skewed. Hence, we reported 5%-timmed standard errors of the parameter estimates that are close to the true and mean estimated standard errors.