PMCCPMCCPMCC

Search tips
Search criteria 

Advanced

 
Logo of nihpaAbout Author manuscriptsSubmit a manuscriptNIH Public Access; Author Manuscript; Accepted for publication in peer reviewed journal;
 
Genet Epidemiol. Author manuscript; available in PMC Dec 1, 2011.
Published in final edited form as:
PMCID: PMC3021130
NIHMSID: NIHMS262474
Genotype-Based Association Mapping of Complex Diseases: Gene-Environment Interactions with Multiple Genetic Markers and Measurement Error in Environmental Exposures
Irvna Lobach,1 Ruzone Fan,2,3,4* and Raymond T. Carroll2
1Division of Biostatistics, New York University, School of Medicine, New York, New York
2Department of Statistics, Texas A&M University, College Station, Texas
3Department of Epidemiology, MD Anderson Cancer Center, University of Texas, Houston, Texas
4Division of Cancer Control and Population Sciences, Surveillance Research Program, National Cancer Institute, Rockville, Maryland
*Correspondence to: Ruzong Fan, Department of Statistics, Texas A&M University, 447 Blocker, College Station, TX 77843-3143. rfan/at/stat.tamu.edu
With the advent of dense single nucleotide polymorphism genotyping, population-based association studies have become the major tools for identifying human disease genes and for fine gene mapping of complex traits. We develop a genotype-based approach for association analysis of case-control studies of gene-environment interactions in the case when environmental factors are measured with error and genotype data are available on multiple genetic markers. To directly use the observed genotype data, we propose two genotype-based models: genotype effect and additive effect models. Our approach offers several advantages. First, the proposed risk functions can directly incorporate the observed genotype data while modeling the linkage disequihbrium information in the regression coefficients, thus eliminating the need to infer haplotype phase. Compared with the haplotype-based approach, an estimating procedure based on the proposed methods can be much simpler and significantly faster. In addition, there is no potential risk due to haplotype phase estimation. Further, by fitting the proposed models, it is possible to analyze the risk alleles/variants of complex diseases, including their dominant or additive effects. To model measurement error, we adopt the pseudo-likelihood method by Lobach et al. [2008]. Performance of the proposed method is examined using simulation experiments. An application of our method is illustrated using a population-based case-control study of association between calcium intake with the risk of colorectal adenoma development.
Keywords: gene-environment interactions, EM-algorithm, errors in variables, linkage disequilibrium, pseudo-likelihood, semi-parametric methods
Case-control studies are widely used to detect gene-environment and gene-gene interactions in the etiology of complex diseases, such as cancer, hypertension, and diabetes. Many variables of interest to biomedical researchers are very difficult to measure on the individual level and oftentimes uncertainty associated with the observed values cannot be avoided in practice. Measurement error causes bias in gene-environment parameter estimates, thus masking key features of data and leading to loss of power and spurious/masked associations [Lobach et al., 2008]. Loss of power prevents the ability to detect important relationships among variables [Carroll et al., 2006]. Nutrition—defined broadly to indicate diet, body size, physical activity—is likely to be causally related to cancer [Schatzkin et al., 2009]. Nevertheless, nutritional epidemiology of cancer remains problematic, largely because of persistent concerns that standard instruments measure diet and physical activity with too much error. For example, in large epidemiologic studies of impact of diet on development of a disease, nutrient intake is commonly measured using the food frequency questionnaire (FFQ). It is well known that the FFQ as a measure of long-term diet is subject both to biases and random errors [Subar et al., 2003].
With the advent of dense single nucleotide polymorphism (SNP) genotyping, population-based linkage disequilibrium (LD) mapping or case-control association studies have become the major tools for identifying human disease genes and for the fine gene mapping of complex traits [Hinds et al., 2005; The International HapMap Consortium, 2003, 2005, 2007; The International SNP Map Working Group, 2001]. LD information reflects structure of the genome and hence provides valuable opportunity for mapping genetic variants responsible for complex diseases. The availability of millions of SNPs greatly facilitates detection of genetic variants of complex diseases using case-control association studies and provides a unique opportunity to increase resolution of the fine genotype mapping. In the meantime, association studies are challenging because of high false-positive rate and computational demands needed to handle massive number of SNPs. Moreover, it is believed that multiple genetic variants and environmental factors interactively cause complex traits [Chatterjee and Carroll, 2005; Lobach et al., 2008; Mukherjee and Chatterjee, 2008; Spinka et aL, 2005]. We are interested in building statistical models that relate genetic variations to complex phenotypes and permit detection of interaction between genetic variants and environmental factors in the difficult but common situation when the environmental factors are measured with uncertainty.
The following important issues are likely to arise in the analysis of population-based case-control association studies: (1) Genetic markers are usually typed locus by locus and oftentimes moderate to high LD exists between the observed markers. The available genetic data are in the form of unphased genotypes and hence the haplotype-based approach requires haplotype-phase estimation; (2) Some of the environmental factors that are likely to interact with the genetic traits cannot be measured directly, and instead only surrogates are available; i.e. measurement error exists in the environmental factors. To be concrete, let us describe the association study of colorectal adenoma that motivated development of our models. The Colorectal Adenoma Study was designed to investigate the interaction between the dietary calcium intake and genetic variants in the calcium-sensing receptor (CaSR) region [Peters et al., 2004]. In this study, the dietary calcium intake is not measured directly. Instead, it is estimated from a baseline FFQ. In addition, genotype information is available on three non-synonymous SNPs in the CaSR region. To detect the interaction of dietary calcium intake and the CaSR genes, two challenges exist: (1) to utilize the FFQ to measure dietary calcium intake is prone to both bias and random error, what needs to be corrected by building appropriate models and (2) to effectively analyze the observed genotype data taking into account LD information between the genetic markers.
The genotype-based and haplotype-based models are the two major approaches widely used in association studies in the presence of LD. The haplotype-based method offers an advantage of modeling LD through the construction of haplotype blocks at the price of computational expense. Lobach et al. [2008] proposed a pseudo-likelihood. method in the case when the environmental factors are prone to error. The haplotype-based approach can be computationally intensive in the case when the number of genetic markers is large and the LD is moderate. Hence, in these cases, the uncertainty associated with the haplotype-based methods can lead to loss of accuracy in risk coefficients estimation [Lin et al., 2002; Marchini et al., 2006; Qin et al, 2002; Stephens et al., 2001; Stephans and Donnelly, 2003].
We propose to develop a genotype-based approach for analysis of case-control studies of gene-environment interactions. A special feature of the proposed method is that the observed genetic information enters the model directly and the LD structure is captured in the regression coefficients. As the basis for estimation and inference, we will use the pseudo-likelihood function developed by Lobach et al. [2008]. The form of this pseudo-likelihood function offers several advantages. One is that it allows to incorporate information about the probability of disease. In epidemiologic studies, a good bound on the probability of disease in a population is generally available. Further, the formulation of the pseudo-likelihood function does not require specification of the distribution of environmental variables measured exactly. These variables include age, ethnicity, BMI and other demographic and clinical measurements. Thus gains in efficiency can be achieved by not having to model a distribution of a multivariate vector of these measurements. The pseudo-likelihood function exploits the gene-environment independence assumption. If the gene-environment independence is not valid in a setting, then a distribution of genotype can be specified within strata defined by the environmental covariate. We propose a risk model that be incorporated in this pseudo-likelihood function that captures the LD structure in the regression coefficients. Hence, the haplotype phase needs not to be estimated, thus reducing computational burden and consequently reducing risk caused by potential bias dueto haplotype-phase estimation.
The organization of the paper is as follows. First, we describe the problem, our methodology and related theoretical results. We then present the results of simulation studies, and we analyze the example discussed above. In Discussion Section, we give concluding remarks. All technical derivations are given in the Appendix A and in the Web Appendix.
MODEL AND NOTATION
Let D be the categorical indicator of disease status. We allow D to have K+1 levels with the possibility of K≥1 to accommodate different subtypes and stages of a disease. Let D = 0 denote the disease-free (control) subjects and D=k, k≥1 denote the diseased (case) subjects of the kth subtype. Suppose the genetic region of interest is spanned by I loci. Let (X,Z) denote all of the environmental (non-genetic) covariates of interest, where X are the factors susceptible to measurement error and Z are additional environmental factors measured without error. Given the environmental covariates X and Z and genotype data G = (G1,G2,…,GI), the risk of the disease in the underlying population is given by the polytomous logistic regression model
equation M1
(1)
Here, mk(·) is a known function parameterizing the joint risk of the disease from G, X, and Z in terms of the odds-ratio parameters β. Assume that all markers are di-allelic, e.g. SNPs. Under the Hardy-Weinberg equihbrium (HWE) assumption, the distribution of the marker genotype can be specified in a parametric form pr(G) = pr(G; Θ), where Θ = (PMi,i = 1,2,…,I) are the allele frequencies. The formulation of our model is general enough to account for deviations from the HWE. For the ith marker, denote the two alleles by Mi and mi, with frequencies PMi and Pmi. Define the following dummy variables
equation M2
(2)
Notice that Ai+1 is the number of allele Mi at the ith marker, and hence Ai can be used to model the allele or additive effect of Mi. In the following, we provide two examples of function mk(·) using the genotype information G = (G1,G2,…,GI). Denote A = (A1,…,AI) and B = (B1,…,BI)
Genotype effect model (GEM)
The following specification of the risk function incorporates both additive and dominance effects of genotype, as well as the multiplicative gene-environment.
equation M3
(3)
In this formulation, the regression coefficients βkAi and βkAi model risk due to the additive and dominance effect, respectively [Fan et al., 2006; Fan and Xiong, 2002]. The remaining terms capture the multiplicative gene-environmental interaction.
Additive effect model (AEM)
Suppose that the dominance effect is not significantly present in the model (3). In this situation, the risk function takes the following form.
equation M4
(4)
The difference between “genotype effect model” (3) and “additive effect model” (4) is that dominance effect and related gene-environmental interactions are not modeled in (4). The number of parameters in function (3) can be significantly larger than that of function (4). In practice, additive effect model (4) can be advantageous over the genotype effect model (3) because of the smaller number of parameters in (4). This situation may occur when the dominance effect is not significantly present or the dominance effect cannot compensate for the increase in the number of parameters in (3).
The model (1) cannot be used directly for analysis since the covariate X is measured with error. Let W denote the error-prone version of X. We assume a parametric model of the form fmem(w|D, G, X, Z; ξ) for the conditional distribution of W given disease-status D, marker genotype G, the true exposure X, and additional environmental factors Z. Measurement error can be modeled both as differential and non-differential. If measurement error can be assumed to be non-differential by disease status, then one can simplify the model as fmem(w|D, G, X, Z;ξ) = fmem(w|G, X, Z; ξ), what does not depend on D. We assume that the joint distribution of the environmental factors in the underlying population can be specified according to a semi-parametric model of the form fX,Z,(x, z) = fX(x|z;η)fZ(Z), where fZ(Z) is left completely unspecified.
Theoretical justification provided in the Appendix proves that the risk functions (3) and (4) are valid for analysis of case-control association studies in the case when genetic markers are in the LD. Briefly we illustrated that (1) the LD is being modeled in the regression coefficients, and (2) if there is no association between observed genotype and trait locus, then all regression coefficients of Ai and Bi are zeros and so the regression does not depend on the markers.
SEMI-PARAMETRIC INFERENCE BASED ON A PSEUDO-LIKELIHOOD
Let n0 be the number of control subjects; and for k≥1, denote by nk the number of subjects in the sample with disease at a stage k. Let n = n0+n1+…nK be the total number of subjects in the sample. In addition, let us denote πk = pr(D = k), k = 0,1, 2,…,K. Consider a sampling scenario where each subject from the underlying population is selected into the case-control study using a Bernoulli sampling scheme, where the selection probability for a subject given his/her disease status D = k is proportional to μk = nkk. In addition, assume that the sampling only depends on the disease status, and so the selection of a subject is independent of the subject's marker information and environmental covariates.
Let R = 1 denote the indicator of whether a subject is selected in the sample. For the ith subject, let us denote by (Di, Gi, Xi, Wi, Zi Ri) the observed values of variables D, G, X, W, Z and R. Let us denote κk = βk0+log(nk/n0)−log(πk0) and κ̃ = (κ1,…,κK)T. In addition, let beta0 = (β10,…,κK0)T, equation M5, B = (ΩTT)T and ν = (ηTT)T. Define
equation M6
We assume that G and (X, Z) are independently distributed in the underlying population. Only changes in notation are needed to model genotype, and environment within strata thus relaxing gene-environment independence assumption. We suppose that the type of genetic covariate measured does not depend on the individual's true genetic covariate, given disease status, environmental covariates, and the measured genetic information. Further, we suppose that the observed genetic variable does not contain any additional information on disease status and true environmental covariate given the genetic variable of interest.
Similarly to Lobach et al. [2008], we propose to use the following pseudo-likelihood function in place of the likelihood function to estimate the parameters. In the Web Appendix A.1, the pseudo-probability is calculated as
equation M7
(5)
where G is the set of all possible genotypes in the population. Lobach et al. [2008] proved that the maximization of LPseudo, although not the actual retrospective-likelihood for case-control data, leads to consistent and asymptotically normal parameter estimates. Observe that conditioning on Z in LPseudo allows it to be free of the nonparametric density function fZ(Z), thus avoiding the difficulty of estimating potentially high-dimensional nuisance parameters.
ASYMPTOTIC THEORY IN CASE WHEN GENETIC MARKERS ARE INDEPENDENT
In this section, we will provide asymptotic results along the lines of Lobach et al. [2008]. These results are readily applicable to the proposed model in the case when no LD is present between the genetic markers.
Estimation With Known Measurement Error Distribution
Assume that the parameter ξ controlling the distribution of the measurement error is known. We show that maximization of LPseudo leads to consistent and asymptotically normal parameter estimates. Let ψ(k, g, w, z, Ω, η, ξ) be the derivative of log{LPseudo(k, g, w, z, Ω, η, ξ)} with respect to B. Then define
equation M8
where all expectations are taken with respect to the case-control sampling design. The estimation [B with circumflex] = (OT,[eta w/ hat]T)T of B are the solution to
equation M9
(6)
In the Web Appendix A.2, we show the following limiting properties of B.
Theorem 1
The estimating function Ln(Ω, η, ξ) is unbiased, i.e. has mean zero when evaluated at the true parameter values. In addition, under suitable regulatory conditions, there is a consistent sequence of solutions to (6), with the property that
equation M10
Remark 1
An EM algorithm for the estimating the parameters, based along the lines of Lobach et al. [2008] and Spinka et al. [2005], is given in the Web Appendix A.3.
Estimated Measurement Error Distribution
In practice, the parameter ξ controlling the measurement error distribution will be unknown, and typically additional data are necessary to estimate it. Here, we consider the case of additive mean-zero measurement error with replications of W. Our convention is that there are at most J replications of the W for any individual. Let Wi denote this ensemble of the J replicates, and let ti be the number of replicates we actually observe. Let fmem(w|k, g, x, z, j; ξ) be the joint density of the first j replicates for j = 1,…,J; ψ(D, G, W, Z; Ω, η, J), Ij and Λj be the matrices defined above for the case with exactly j replicates for each individual. Assume that ji is independent of (Di, Gi, Xi, Wi, Zi) and that probability to observe j replicates is p(j). Further, define equation M11. The estimating function for B = (ΩT, ηT, ξ)T can be written [Lobach et al., 2008] in the form
equation M12
(7)
The parameter estimates have the following asymptotic properties (Web Appendix A.4).
Theorem 2
The estimating function (7) is unbiased, i.e., has mean zero when evaluated at the true parameter values. In addition, under suitable regulatory conditions, there is a consistent sequence of solutions to (7), with the property that
equation M13
(8)
ASYMPTOTICS IN CASE WHEN GENETIC MARKERS ARE IN LINKAGE DISEQUILIBRIUM
Recall that in the case when genetic markers modeled in the risk function are in the LD, the regression coefficients capture both the association signal and the LD information. In practice, if the LD is present in the genetic material, we suggest to construct and investigate the model based on genetic markers that are known to be associated with disease. The association information can be obtained either based on a priori knowledge (e.g. previously reported studies, biological interpretation), or can be inferred using the observed data (e.g. model selection procedure).
The distribution of the genotype in the population for a pair of markers that are in LD can be written as follows:
equation M14
Define
equation M15
(9)
We propose to estimate parameters (Ω, η, ξ, Δ) based on the pseudo-likelihood function that is of the same form as (5), but is based on the S(k, g, x, z; Ω, Δ) function in the form (9), specifically,
equation M16
(10)
Justification of the risk model and regression coefficients presented in the first section of the Appendix A suggests that the LD information between the observed genetic markers and the trait locus is captured in the regression coefficients.
We performed a series of simulation experiments to investigate the performance of the proposed procedure in various settings. We consider a case when the disease status D is binary (i.e. K=1). Note that
equation M17
(11)
Thus, the parameters (β0 = β10, β, Θ, η) are sufficient to identify pr(D = 1), i.e., κ = β1 is identified from (β0, β, Θ, η). This means that simply using (5) as a likelihood function directly will be unstable because of over-parametrization. To overcome this, we may re-parametrize in terms of pr(D = 1) through (11). In addition, let κ be a function of pr(D = 1). This obvious solution can solve the over-parametrization problem.
The genotype G was simulated under HWE for I = 1,2,3. Given the values of (G, X), we generated a binary disease outcome D using two logistic models, corresponding to the GEM and AEM. For the GEM, covariates are related to a disease via link function
equation M18
and the corresponding AEM was obtained by setting coefficients βDi and βDXi to be 0. Here we omit the subscription k in the regression parameters βs since we have one level disease cases and normal controls.
We considered three settings of the disease risk function. In the first setting, only one marker is involved in a disease. This marker has weak additive and dominance effects (βA1 = log(1.5) and βD1 = log(1.3)), while the interaction effect with the environment is strong for both additive and dominance components (βXA1 = log(2.5) and βXD1 = log(3)). In the second setting, in addition to the marker described above, we added one with stronger additive component (βA2 = log(2.2)) and almost no dominance βD2 = log(1.1). Interaction effects of both additive and dominance components are strong (βAX2 = log(2.2) and βDX2 = log(2.5)). Finally, in the third setting, we added one additional marker with strong additive and dominance components in addition to the strong interaction effects βA3 = log(2), βAX3 = logO), βD3 = log(2), βDX3 = log(3).
THE DISCRETE CASE
Consider the case when disease status and environmental variables (X, W) are binary. Let pr(X = 1) = 0.5. We simulated the observed environmental variable W using the following misclassification probabilities. pr(W = 0|X = 1) = 0.25 for the exposed participants and pr(W = 1|X = 0) = 0.20 for the non-exposed. We performed a simulation sub-study when probability of disease is not known and it is estimated via grid-search method. The values of πd are set to be on interval (0.001, 0.04) with step 0.005 and the resulting estimate is a value that maximizes the pseudo-likelihood function. Parameter β0 is estimated by solving equation (11). To estimate the parameters, 500 samples are simulated and each sample contains 1,000 cases and 1,000 controls. To illustrate performance and advantages of the proposed method, we presented biases and Root Mean-Squared Errors (RMSE).
Shown in the Table I are simulation results for the case when three independent genetic markers are observed. The results illustrate that the proposed methodology produced parameter estimates that are nearly unbiased and have small variability. The naive approach that ignores existence of the measurement error and pretends that the environment is observed exactly results in biased estimates with variability that is larger than that of the proposed approach. These simulation results illustrate that the RMSE of coefficients βDi; and βDXi are generally larger than those of βAi and βAXi, thus suggesting that the dominance effect should only be used in the situations when the data present strong evidence for the dominance effect. In Web Tables TablesII and andII,II, we present the results for two-marker case, additive and genotype effect models, respectively. Findings shown in the Web Tables TablesII and andIIII are similar to those of the Table I.
TABLE I
TABLE I
Biases and RMSEs of risk parameters for the naive approach that ignores existence of measurement error and the proposed method in the case when pr(D = 1) is known and when it is estimated
TABLE II
TABLE II
Biases and RMSEs of risk parameters for the naive approach that ignores existence of the LD and the proposed method
The setup described above simulates markers that are in linkage equilibrium. To evaluate performance of the proposed method in the case when genetic markers are in the LD, we considered the following simulation setup. We simulated the observed genotype according to the following frequencies:
equation M19
We further compared performance of the proposed approach and the one that ignores existence of the LD. We found that when the LD is small (e.g. 0.005), the parameter estimates based on the model (5) are nearly unbiased and have small variability (Web Table III), because the coefficients capture enough of the LD information. To test the performance of the proposed models in the case when moderate amount of LD is present (Δ = 0.01, 0.02, 0.05) and compare it to the procedure that ignores existence of the LD, we performed the following simulation experiment. We simulated the observed data using a simulation setup described above with two markers that are in LD using AEM. Results presented in the Table II illustrate that the naive approach resulted in parameter estimates that are biased and are highly variable, while the proposed method eliminated bias and substantially reduced the variability of the estimates.
TABLE III
TABLE III
Biases and RMSEs of risk parameters for the naive approach that ignores existence of measurement error and the proposed method
CONTINUOUS SIMULATIONS
In this simulation experiment, we considered a continuous environmental variables. We simulated the true environmental covariate X from a Normal distribution with zero mean and variance 0.1. To simulate observed environmental variables, we used additive model of the form W = X + U, where U is generated from the normal distribution with zero mean and variance ξ = 0.25. Note that we are simulating a case of large measurement error, to mimic a situation that occurs in practice while measuring diet. To estimate the probability of disease, we used grid-search method on the interval (0.001, 0.051) with step 0.005 by maximizing the pseudo-likelihood function for values of probability of disease fixed on a grid and then performing a grid-search to identify the value of probability of disease that maximized the likelihood.
Within this simulation setup, we suppose that the measurement error distribution is known. Table III presents the results of three-marker case under the additive effect model. We found that for our method there is no noticeable bias in parameter estimates, whereas the naive approach that ignores existence of the measurement error results in substantial bias (Table III). The RMSEs of coefficients βAi in Table III are reasonable. However, the RMSEs of βAXi in Table III are generally larger because they are based on the continuous covariate with noise that is 2.5 times more variable than the signal. We are giving a very stringent test to our method in the case when the environmental covariate is continuous because in practice the measurement error is massive. In the Web Tables TablesIVIV and andV,V, we present the results of one marker case for additive effect model and genotype model, respectively; in Web Table VI, we report the results of two-marker case for the additive effect model. The three Web Tables provide similar results as those of Table III.
TABLE IV
TABLE IV
Standard errors (SE) of risk parameters for the proposed approach
TABLE V
TABLE V
Estimates and 95% Wald confidence intervals of parameters for the colorectal adenoma study
To investigate accuracy of the proposed variance estimator, we performed an experiment and reported results in Table IV. The results suggest that the mean estimated standard error is nearly unbiased. However, the variability of the parameter estimates is elevated. This phenomena is well known in the measurement error literature and noted in our previous work [Lobach et al., 2008]. When the measurement noise is large, which is the case in our situation, the sampling distribution of the parameter estimates can be skewed. Hence, we reported 5%-timmed standard errors of the parameter estimates that are close to the true and mean estimated standard errors.
SINGLE MARKER ANALYSIS
Model
To illustrate the application of the proposed method we analyzed the colorectal adenoma study described in the introduction. To recap, there were 772 cases and 778 controls, the response D was colorectal adenoma status, the genetic data observed were three SNPs in the calcium receptor gene CaSR, the environmental variable X measured with error was log(1 +calcium intake), which was measured by W, the result of a FFQ. The variables Z measured without error were age, sex, and race, which are not significant and are not included in the final model. The two alleles at the first SNP are A and G, the two alleles at the second SNP are C and G, and the two alleles at the third SNP are G and T. Let us denote M1 = A, m1 = G, M2 = C, m2 = G, M3 = G and m3 = T; and then define dummy variables Ai accordingly by (2). Based on the observed genetic data only, we estimated the LD measures as follows: ΔM1,M3 = 0.011 for the first and third markers; ΔM2M3 = 0.012 for second and third markers; and ΔM1M2 is approximately zero. Given calcium intake X and genotype information G = (G1, G2, G3), we considered several risk model based on the following strategy. We first analyzed the three observed markers using the risk models based on one marker at a time in the following form
equation M20
and the model is denoted as AEM1, AEM2, and AEM3, respectively.
Measurement Error Modeling
Unfortunately, there is no direct information in the study to assess the measurement error properties of calcium intake W. We used a combination of outside data and sensitivity analysis instead. The outside data came from the WISH Study [Potischman et al., 2002]. There were ≈400 women in this study, which used the same FFQ as in the colorectal adenoma study and also included the results of six 24-hr recall measurements, which we denoted by Tij for the ith individual and jth replicate. The model for these data were that
equation M21
where Ui, = Normal equation M22 and Vij = Normal equation M23. Using variance components analysis, we estimated equation M24, and took these as fixed and known in the colorectal adenoma study, although we also varied equation M25. The distribution of X was taken to be Gaussian with mean linear in Z and variance ξ. We used the method of Fuller 1987, Chapter 2, 5 and found estimates equation M26. To assess sensitivity to the measurement error model specification, we considered several scenarios by imposing measurement error structure estimated using WISH data and varying it through equation M27.
Results
After fitting the three models, AEM1, AEM2, and AEM3, we found significant results for AEM1 and AEM2 based on analysis of parameter estimates and 95% Wald confidence intervals (Table V). For the AEM1, each of the three regression coefficients βA1, βX, and βAX1 was significantly different from 0; and so was each of the three regression coefficients βA2, βX and βAX2 for AEM2 (Table V). Since the estimate −0.478 of parameter βAX1 was negative in AEM1 and so the estimate −0.771 of βAX2 in AEM2, the results suggested protective effect of an interaction between the calcium intake and additive effect of allele M1 = A of the first marker and allele M2 = C of the second marker.
For AEM3, none of the three regression coefficients βA3, βX and βAX3 was significant (data not shown). In addition, we added dominance effect and the related gene-environmental interaction terms to the AEM1, AEM2, and AEM3 to fit the data. No significant result was found for the dominance effect and the related gene-environmental interactions (data not shown). Thus, the additive effect models (AEM1 and AEM2) are enough to fit the data.
MULTIPLE MARKER ANALYSIS
Motivated by the results of single marker analysis, we considered the following risk model that models the effect of a pair of markers 1 and 2
equation M28
which we denoted as AEM12. The results are presented in the Table V. All regression coefficients except βA1 in AEM12 were significantly different from 0. The results of Table V confirmed the protective effects of the allele M1 = A at the first marker and the allele M2 = C at the second marker.
In addition to AEM12 using markers 1 and 2 in analysis, we fitted two more models: (1) AEM13: markers 1 and 3; and (2) AEM23: markers 2 and 3, respectively. The results, however, suggested that the third marker does not provide significant effect on the models AEM13 and AEM23 (data not shown). Hence, markers 1 and 2 are enough to model the association with the disease trait, and marker 3 contributed no significant information in addition to marker 1 or 2.
COMPARISON WITH THE RESULTS OF HAPLOTYPE-BASED APPROACHES
In Lobach et al. [2008], two haplotypes h4 and h5 have protective effects against colorectal tumor development and a haplotype h2 has no significant effect. h4 is haplotype GCG, h5 is a combination of a common haplotype AGG plus three rare haplotypes AGT, GGG, and GCT. Compared with our results that the allele M1 = A at the first marker and the allele M2 = C at the second marker have protective effects, the protective effect of h4 = GCG may be due to the allele M2 = C at the second marker and the protective effect of h5 may be due to the allele M1 = A at the first marker. The alleles at the third marker make no significant contribution to the colorectal tumor development.
In this article, we proposed a genotype-based approach for the analysis of case-control studies of gene-environment interactions in the presence of measurement error in the environmental factors. Two types of risk functions are proposed along the lines of the previous work of the second author: genotype and additive effect models [Fan et al, 2006; Fan and Xiong, 2002]. The genotype effect models capture both the additive and dominance effects, while the additive effect model takes into account the additive effect of genetic markers. The proposed method has several unique aspects. First, the observed genetic information enters the model directly and the pairwise LD structure is captured in the regression coefficients. Compared with the haplotype-based approaches, the proposed models are simpler and the theoretical justification/inference/asymptotics can be simpler too. Subsequently, in practice, the computational burden is less demanding since there is no need to estimate the haplotype phase. Moreover, there is no risk of potential bias due to haplotype phase estimations. The second unique aspect of the proposed method efficiently models the environmental covariates are measured with substantial error. So far, there is no genotype-based methods in literature to deal with the issue and the proposed method can fill the gaps. Similary to the method investigated in Lobach et al. [2008], the estimating procedure is based on a pseudo-likelihood model that allows to efficiently estimate parameters, model environmental covariates completely non-parametrically, and incorporate information about the probability of disease. In epidemiologic studies, the vector of environmental covariates measured exactly is oftentimes high dimensional and a good estimate about probability of disease in a population is known. Thus, the use of the proposed pseudo-likelihood function offers advantages.
Simulation experiments illustrated that the proposed methods can lead to nearly unbiased parameter estimates and the variability is generally low for the additive effect terms. In the genotype effect models, the variability of the dominance effect terms can be slightly elevated. Hence, in the case when dominance effect is moderate or is not significant, the additive effect models are superior to the genotype effect models as we noted in our previous work [Fan et al., 2006; Fan and Xiong, 2002]. In comparison, the naive estimation that ignores existence of measurement error and/or LD results in parameter estimates that are largely biased and variable.
The proposed methods will prove useful when the amount of LD is small or moderate. In this case, the number of possible haplotypes that are consistent with the observed genotype is large and hence the estimating procedure can be computationally intensive. Simulation experiments illustrated that the proposed method resulted in parameter estimates that are nearly unbiased and have small variability in the cases: (1) the LD is small and the LD is only modeled in the regression coefficients; (2) the LD is moderate and the proposed method estimates that LD and risk parameters simultaneously. When the number of markers is much larger than two and the strong LD is present, haplotype-based approaches developed in Lobach et al. [2008] can be more useful. In particular, if a disease is mainly due to one or two haplotypes and the haplotype consists of multiple alleles at different markers and none of the alleles has big effect on the disease, the proposed genotype-based approach can be less powerful while the haplotype-based approach can be more powerful. On the other hand, the proposed models can be powerful when some alleles have strong effect on the disease.
As a result of application of the proposed method to the analysis of colorectal adenoma study, we found that the protective effects of the allele M1 = A at the first marker and the allele M2 = C at the second marker. In Lobach et al. [2008], two haplotypes h4 = GCG and h5 (mainly AGG) have protective effects against colorectal tumor development. Compared with our results, the protective effect of h4 may be due to the allele M2 = C at the second marker and the protective effect of h5 may be due to the allele M1 = A at the first marker. Therefore, the proposed approach has the ability to identify important alleles and markers, which have significant contribution to the colorectal tumor development.
The proposed risk model is readily expendable for the analysis of gene-gene interactions in the case when the genetic markers involved in an interaction are independent. For the colorectal adenoma study data, we fit the model by adding the product terms A1A2 of dummy variables A1 and A2 but no significant gene-gene interaction effect was found.
ACKNOWLEDGMENTS
The research of Fan was supported by the National Cancer Institute grant R01-CA133996 from MD Anderson Cancer Center (P.I. C. Amos), where Fan spent his sabbatical during the 2008–2009 academic year; and a Research and Travel Support from the Intergovernmental Personnel Act (IPA), National Institutes of Health, in 2010. The research of Carroll was supported by a grant from the National Cancer Institute (R37-CA057030).
Contract grant sponsor: National Cancer Institute; Contract grant numbers: R01-CA133996; R37-CA057030; Contract grant sponsor: National Institutes of Health.
APPENDIX A
JUSTIFICATION OF THE MODEL AND REGRESSION COEFFICIENTS
To justify that the proposed risk functions (3) and (4) are valid for analysis of case-control association studies in the case when genetic markers are in the LD. Consider a situation when genetic markers are in the HWE. Suppose that only one trait locus Q affects the disease status, and there are two alleles Q1 and Q2 at the trait locus. Let qi be the frequency of allele Qi i = 1,2. Let equation M29 be the log ratio of the disease given genotype Qi,Qj and environmental covariates (X, Z), i,j = 1,2. Denote equation M30 and equation M31. As traditional quantitative genetics, the average effect of gene substitution is αQk = αk + (q2q1) dk i.e. the difference between the average effects of the trait locus alleles, and dominance deviation is δQk = 2dk [Falconer and Mackay, 1996], Let equation M32 be the population mean effect. It can be shown that
equation M33
(A.1)
where
equation M34
Hence, the genotypic value equation M35 can be expressed by a linear combination of μ(k), AQ, and BQ. Suppose that the marker M1 coincides with the trait locus Q, and marker allele Mi is the trait allele Q1 and marker allele mi is the trait allele Q2-Then after a linear transformation, the relation (A.1) can be re-expressed by a linear combination of μ(k), A1 = AQ, and B1 = BQ (actually, it can be seen that AQ+2q1 = A1+1).
First, denote for k = 1, …,K
equation M36
(A.2)
If no covariates are considered, it can be shown that the LD information between trait locus Q and markers M1 is contained in the probabilities pr(D = k, Q1Qj, G = g). To see this, let hdip = (hi, h2) denote a phased haplotype of an individual at the markers. Notice
equation M37
and the probability equation M38 contains the LD information between trait locus Q and markers Mi. Here, the subscript hdip[set membership]G denote all haplotype phases, which are consistent with the genotype G. If the covariates are considered, then
equation M39
and the probability equation M40 contains the LD information between trait locus Q and markers Mi. Thus, Yk(G, X, Z) is a function of the LD information between trait locus Q and markers Mi, and it contains association information between the trait locus and the markers.
Denote the pairwise measure of LD between marker Mi and marker Mj by ΔMiMj = P(MiMj) – PMiPMj,i<j, i, j = 1, …, I Let the additive and dominance variance-covar-iance matrices of the indicator variables defined in (2) be (second section of the Appendix A)
equation M41
(A.3)
Define
equation M42
Given X and Z, the coefficients of model (A.2) are derived as (third section of the Appendix A)
equation M43
(A.4)
The form of Cov(Yk, A IX, Z) and Cov(Yk, BIX, Z) is given in the Appendix A.3.
Since Yk(G, X, Z) contains association information between the trait locus and the markers, Equations (A.4) show that the measures of LD are contained in the mean coefficients. Model (3) simultaneously captures the LD and the effects of the trait locus.
Given X and Z, assume that QiQj and G are independent or, equivalently, there is no association between the trait and markers. Then Yk does not depend on G since
equation M44
Therefore, Cov(Yk, Ai|X, Z) - 0 and Cov(Yk Bi|X, Z) = 0 for all markers i = l,…,I. Hence, the regression coefficients of Ai and Bi are all zero, and the function mk(·) does not depend on the marker genotype data.
In summary, we illustrated that (1) the LD is being modeled in the regression coefficients, and (2) if there is no association between observed genotype and trait locus, then all regression coefficients of Ai and Bi, are zeros and so the regression does not depend on the markers.
DERIVATION OF VARIANCE-COVARIANCE MATRICES (A.3)
Similarly to the Appendix A, Fan and Xiong [2002], the following expectations, variance and covariances can be derived accordingly: E(Ai) = PMiPMi, E(Bi) = 0, Var(Ai) = 2PMjMi, equation M45, Cov(Ai, Aj)=2ΔMiMj, equation M46, i,j = 1, …,I, Ij; in addition, Cov(Ai, Bj) = 0 for all i and j.
Define A = (A1,…, AI) and B = (B1,…, B1). Further, let VA be the I × I matrix with diagonal elements 2PMjmi and off-diagonals 2ΔMiMj. Similarly, let VD be the I×I matrix with diagonal elements equation M47 and off-diagonals equation M48. Note that VA and VD are the covariance matrices of the additive and dominance effects, respectively. Let Oi be a I × I matrix with zero elements. Then based on the expectations and covariances described above,
equation M49
PROOF OF (A.4)
Based on the definition of Yk and the form of the covariance matrices (A.3), it can be easily seen that for j = 1,…,I
equation M50
Equation (A.4) readily follows. To calculate the covariance between Yk and Ai, Bi let us denote G(Mi;Mi) = [(Gi,…, Gi, MiMi, Gi+1, …, G1): Gj[set membership](MjMj, Mjmj, mjmj), ji] and similarly we may define G(Mimi) and G(mimi). Using these notations, we may calculate
equation M51
equation M52
equation M53
Footnotes
Additional Supporting Information may be found in the online version of this article.
  • Carroll RJ, Ruppert D, Stefanski LA, Crainiceanu CM. Measurement Error in Nonlinear Models. 2nd edition Chapman & Hall, CRC Press; London, Boca Raton: 2006.
  • Chatterjee N, Carroll RJ. Semiparametric maximum likelihood estimation in case-control studies of gene-environmental interactions. Biometrika. 2005;92:399–418.
  • Fan R, Xiong M. High resolution mapping of quantitative trait loci by linkage disequilibrium analysis. Eur J Hum Genet. 2002;10:607–615. [PubMed]
  • Fan R, Jung J, Jin J. High resolution association mapping of quantitative trait loci, a population based approach. Genetics. 2006;172:663–686. [PubMed]
  • Fuller WA. Measurement Error Models. Wiley; New York: 1987.
  • Hinds DA, Stuve LL, Nilsen GB, Halperin E, Eskin E, et al. Whole-genome patterns of common DNA variation in three human populations. Science. 2005;307:1072–1079. [PubMed]
  • The International HapMap Consortium The International HapMap Project. Nature. 2003;426:789–796. [PubMed]
  • The International HapMap Consortium A haplotype map of the human genome. Nature. 2005;437:1299–1320. [PMC free article] [PubMed]
  • The International HapMap Consortium A second generation human haplotype map of over 3.1 million SNPs. Nature. 2007;449:851–861. [PMC free article] [PubMed]
  • The International SNP Map Working Group A map of human genome sequence variation containing 1.42 million single nucleotide polymorphisms. Nature. 2001;409:928–933. [PubMed]
  • Lin S, Cutler DJ, Zwick ME, Chakravarti A. Haplotype inference in random population samples. Am J Hum Genet. 2002;71:1129–1137. [PubMed]
  • Lobach I, Carroll RJ, Spinka C, Gail MH, Chatterjee N. Haplotype-based regression analysis of case-control studies with unphased genotypes and measurement errors in environmental exposures. Biometrics. 2008;64:673–684. [PMC free article] [PubMed]
  • Marchini J, Cutler D, Patterson N, Stephens M, Eskin E, Halperin E, Lin S, Qin Z, Munro HM, Abecasis GR, Donnelly P., for the International HapMap Consortium A comparison of phasing algorithms for trios and unrelated individuals. Am J Hum Genet. 2006;78:437–450. [PubMed]
  • Mukherjee B, Chatterjee N. Exploiting gene-environment independence for analysis of casecontrol studies: an empirical Bayes-type shrinkage estimator to trade-off between bias and efficiency. Biometrics. 2008;64:685–694. [PubMed]
  • Peters U, Chatterjee N, Yeager M, Chanock SJ, Schoen RE, McGlynn KA, Church TR, Weissfeld JL, Schatzkin A, Hayes RB. Association of genetic variants in the calcium-sensing receptor with risk of colorectal adenoma. Cancer Epidemiol Biomarkers Prev. 2004;13:2181–2186. [PubMed]
  • Potischman N, Coates RJ, Swanson CA, Carroll RJ, Dating JR, Brogan DR, Gammon MD, Midthune D, Curtin J, Brinton LA. Increased risk of early stage breast cancer related to consumption of sweet foods among women less than age 45. Cancer Causes Control. 2002;13:937–946. [PubMed]
  • Schatzkin A, Subar AF, Moore S, Park Y, Potischman N, Thompson FE, Leitzmann M, Hollenbeck A, Morrissey KG, Kipnis V. Observational epidemiologic studies of nutrition and cancer: the next generation (with better observation) Cancer Epidemiol Biomarkers Prev. 2009;18:1026. [PMC free article] [PubMed]
  • Spinka C, Carroll RJ, Chatterjee N. Analysis of case-control studies of genetic and environmental factors with missing genetic information and haplotype-phase ambiguity. Genet Epidemiol. 2005;29:108–127. [PMC free article] [PubMed]
  • Stephans M, Donnelly P. A comparison of Bayesian methods for haplotype reconstruction from population genotype data. Am J Hum Genet. 2003;73:1162–1169. [PubMed]
  • Stephens M, Smith NJ, Donnelly P. A new statistical method for haplotype reconstruction from population data. Am J Hum Genet. 2001;68:978–989. [PubMed]
  • Subar AF, Kipnis V, Troiano RP, Midthune D, Schoeller DA, Bingham S, Sharbaugh CO, Trabulsi J, Runswick S, Ballaard-Barbash R, Sunshine J, Schatzkin A. Using intake biomarkers to evaluate the extent of dietary misreporting in a large sample of adults: the Observing Protein and Energy Nutrition (OPEN) study. Am J Epidemiol. 2003;158:1–13. [PubMed]
  • Qin ZS, Niu T, Liu JS. Partial-ligation-expectation-maximization for haplotype inference with single nucleotide polymorphisms. Am J Hum Genet. 2002;71:1242–1247. [PubMed]