PMCCPMCCPMCC

Search tips
Search criteria 

Advanced

 
Logo of plosgenPLoS GeneticsSubmit to PLoSGet E-mail AlertsContact UsPublic Library of Science (PLoS)View this Article
 
PLoS Genet. Jan 2012; 8(1): e1002482.
Published online Jan 26, 2012. doi:  10.1371/journal.pgen.1002482
PMCID: PMC3266891
A Flexible Bayesian Model for Studying Gene–Environment Interaction
Kai Yu,1* Sholom Wacholder,1 William Wheeler,2 Zhaoming Wang,1,3 Neil Caporaso,1 Maria Teresa Landi,1 and Faming Liang4
1Division of Cancer Epidemiology and Genetics, National Cancer Institute, Rockville, Maryland, United States of America
2Information Management Services, Rockville, Maryland, United States of America
3Core Genotyping Facility, SAIC Frederick, National Cancer Institute–Frederick, Frederick, Maryland, United States of America
4Department of Statistics, Texas A&M University, College Station, Texas, United States of America
Nicholas J. Schork, Editor
University of California San Diego and The Scripps Research Institute, United States of America
* E-mail: yuka/at/mail.nih.gov
Conceived and designed the experiments: KY FL. Analyzed the data: KY WW. Contributed reagents/materials/analysis tools: KY ZW WW NC MTL FL. Wrote the paper: KY SW FL.
Received June 1, 2011; Accepted November 30, 2011.
An important follow-up step after genetic markers are found to be associated with a disease outcome is a more detailed analysis investigating how the implicated gene or chromosomal region and an established environment risk factor interact to influence the disease risk. The standard approach to this study of gene–environment interaction considers one genetic marker at a time and therefore could misrepresent and underestimate the genetic contribution to the joint effect when one or more functional loci, some of which might not be genotyped, exist in the region and interact with the environment risk factor in a complex way. We develop a more global approach based on a Bayesian model that uses a latent genetic profile variable to capture all of the genetic variation in the entire targeted region and allows the environment effect to vary across different genetic profile categories. We also propose a resampling-based test derived from the developed Bayesian model for the detection of gene–environment interaction. Using data collected in the Environment and Genetics in Lung Cancer Etiology (EAGLE) study, we apply the Bayesian model to evaluate the joint effect of smoking intensity and genetic variants in the 15q25.1 region, which contains a cluster of nicotinic acetylcholine receptor genes and has been shown to be associated with both lung cancer and smoking behavior. We find evidence for gene–environment interaction (P-value = 0.016), with the smoking effect appearing to be stronger in subjects with a genetic profile associated with a higher lung cancer risk; the conventional test of gene–environment interaction based on the single-marker approach is far from significant.
Many common diseases result from a complex interplay of genetic and environmental risk factors. It is important to study the potential genetic and environmental risk factors jointly in order to achieve a better understanding of the mechanisms underlying disease development. The standard single-marker approach that studies the environmental risk factor and one genetic marker at a time could misrepresent the gene–environment interaction, as the single genetic marker might not be an appropriate surrogate for the underlying genetic functioning polymorphisms. We propose a method to look at gene–environment interaction at the gene/region level by integrating information observed on multiple genetic markers within the selected gene/region with measures of environmental exposure. Using data collected in the Environment and Genetics in Lung Cancer Etiology (EAGLE) study, we apply the proposed model to evaluate the joint effect of smoking intensity and genetic variants in the 15q25.1 region and find evidence for gene–environment interaction (P-value = 0.016), with the smoking effect varying according to a subject's genetic profile.
Genome-wide association studies that focus on detecting the main effect from individual single nucleotide polymorphisms (SNPs) have successfully identified more than 4,000 SNPs associated with different diseases [1]. To achieve a better understanding of the mechanisms underlying disease development, it is of great interest to follow up those genetic findings with more detailed analyses investigating how the gene and environment interact in their influence on disease risk. One popular approach aims at detecting SNP-environment interaction between individual SNPs and established environmental risk factors [2], [3], [4]. One of the few successes for this approach is the interaction detected between cigarette smoking and two genetic variants, a NAT2 tagging SNP and a GSTM1 deletion, in a multi-stage genome-wide association study (GWAS) of bladder cancer [3].
The standard approach to the study of gene–environment joint effect inspects one marker at a time, assuming that a single marker is the functional unit in the gene and environment interplay. This single-marker approach could misrepresent and underestimate the genetic contribution to the joint effect when one or more functional loci, some of which might not be genotyped, exist in the region, and interact with the environment risk factor in a complex way. A more global approach that simultaneously considers all genetic markers might capture more of the genetic variation within the entire targeted region, and provides a better opportunity to reveal complicated gene–environment interactions [5]. The global approach would be more informative if it has the capability showing how an environmental effect varies according to a subject's genetic profile.
We provide a flexible Bayesian modeling framework for the study of gene–environment joint effects. We consider a case-control study with genotypes G at a set of SNPs within a given region and a measurement for the environment exposure E available for each subject. We seek to identify a latent genetic profile variable L that classifies the multilocus genotype G into different categories (clusters) such that subjects with their genotype assigned to the same genetic profile category share the same disease risk model, which is a standard logistic regression model with its own intercept term and slope. The intercept term represents the baseline log odds, common for subjects sharing the same genetic profile. The slope represents the effect (i.e., log odds ratio) of the environment risk factor for subjects with the given genetic profile. The model that we try to build and make inferences from is essentially the logistic regression model consisting of L and E as main effects and their product as an interaction term; the unusual aspect is that the definition of the latent genetic profile L is a priori unknown. To account for the uncertainty in the cluster assignment underlying the definition of L, we adopt an idea from the hidden Markov model originally developed for modeling the spatial heterogeneity of the disease event rate, observed on a predefined set of areas [6]. In this Bayesian model approach, Green and Richardson tried to allocate areas into a number of clusters and assumed a common disease rate for areas assigned to the same cluster. The mechanism for the area allocation was modeled through the Potts model [7], which favors probabilistically those allocation patterns where “neighboring” areas are assigned to the same cluster. Note that the spatial dependence assumption is generally appropriate in situations where the event rate is expected to take on similar values in neighboring areas. To draw the connection, we can think of each type of observed multi-locus genotype G as an “area”. We would like to use the Potts model to guide the cluster assignment through a certain level of “spatial” dependence, i.e., similar genotypes (nearby areas) tend to be assigned to the same cluster, as in other applications in genetics studies, including the study of haplotype association [8], [9].
We use the Markov chain Monte Carlo (MCMC) sampling method (e.g., [10], [11]) to fit the proposed model, incorporating several recent advances in the MCMC methodology. We adopt a recently developed algorithm [12] to update the regulating parameter in the Potts model, which has an intractable normalizing constant, and cannot be handled by the standard Metropolis Hastings algorithm. This algorithm allows us to consider the parameter of interest on its original continuous scale and obviates the need for a finite number of selected grids with their normalizing constants pre-calculated, a strategy taken by Green and Richardson [6]. To identify the optimal genetic profile assignment, we use an ensemble averaging method that aggregates different cluster assignments generated by the MCMC samplers into a consensual one. We find that this cluster algorithm works quite well in simulation studies. A similar idea has been used by Liang [13] and Molitor et al [14] in different contexts. We also propose a resampling-based test based on the fitted Bayesian model that can be used to formally test for the existence of gene–environment interaction.
We apply the proposed method to study the joint effect of cigarette smoking intensity and genetic variants in chromosome region 15q25.1 using data from EAGLE, a population-based case-control study conducted in Italy [15]. Cigarette smoking is an established major risk factor for lung cancer. Besides environmental exposures, recent GWAS identified a few chromosome regions (e.g., chromosomes 15q25.1, 5p15, and 6p21) harboring genetic variants underlying a susceptibility for lung cancer [15], [16], [17]. In particular, the chromosome 15q25.1 region, which includes the CHRNA5-CHRNA3-CHRNB4 cluster of cholinergic nicotinic receptor subunit genes, has been shown to be associated with both lung cancer and smoking behaviors, such as cigarette smoking intensity [18], [19], [20], [21], [22]. Although there is no evidence suggesting the existence of multiple loci in this region independently contributing to lung cancer susceptibility in populations of European ancestry [16], it does appear that there are multiple independent loci within 15q25.1 affecting smoking intensity [19]. The main goal of our analysis is to evaluate whether the effect of smoking intensity varies with genetic variants in 15q25.1. Our analysis finds evidence for gene–environment interaction, with the relative risk for smoking appearing to be stronger in subjects with a genetic profile associated with a higher lung cancer risk. The proposed resampling-based test derived from the fitted Bayesian model also detects significant gene–environment interaction (P-value = 0.016). On the other hand, the standard single-marker approach that aims at detecting the interaction between a SNP and smoking intensity fails to reveal any evidence of interaction, with the smallest observed nominal P-value being 0.021 among the 32 testing SNPs, and the adjusted P-value based on the permutation test being 0.29.
We first introduce the Bayesian model and describe the MCMC algorithm for fitting this model. Next we provide procedures for posterior inference using samples generated by the MCMC sampler, including a method for deciding the number of clusters and a method for identifying the optimal cluster assignment once the number of clusters is determined. We validate the proposed method using simulated data. We then apply the method to study the gene–environment joint effect using data generated from the EAGLE lung cancer case-control study.
The Bayesian Model Setup
Assume we have data collected from a case-control study, with An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e001.jpg cases, An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e002.jpg controls. Let An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e003.jpg be the total number of subjects in the study. For the ith subject, we denote its observation by An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e004.jpg, where An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e005.jpg for a case, 0 for a control; An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e006.jpg is the observed exposure for the environment risk factor of interest; An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e007.jpg represents measures on a set of covariates; and An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e008.jpg represents multilocus genotypes observed on a set of SNPs in a pre-specified region. In the following discussion, we use the term genotype to refer to the multilocus genotype observed on all considered SNPs within the targeted region. We intend to develop a model for the G-E joint effect that permits G-E interaction. More specifically, we assume the true underlying risk model has the following form:
A mathematical equation, expression, or formula.
 Object name is pgen.1002482.e009.jpg
(1)
where clusters 1 to K represent a partition of the genotype space; An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e010.jpg is the intercept term representing the common baseline log odds for subjects with their genotypes in cluster k; An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e011.jpg is the effect of E (in term of log odds ratio) in the disease model for cluster k, An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e012.jpg; and An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e013.jpg is the vector of coefficients for the set of covariates X and is constant regardless of a subject's genotype. Notice that if the partition of the joint genotype space is known a priori, we can derive the corresponding K-category genetic profile variable L based on the cluster assignment. The above model (1) is then essentially the standard logistic regression model consisting of L and E as main effects and their product as the interaction term, with adjustment for X, and has the following form:
A mathematical equation, expression, or formula.
 Object name is pgen.1002482.e014.jpg
Thus it is clear that there is no G-E interaction if An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e015.jpg, and the interaction exists if otherwise.
In real applications, we do not know a priori the partition of the genotype space. If G consists of just one SNP, the goal can be achieved easily by using a saturated logistic regression including both E and G (as a three-level categorical variable) as the main effects and their product as the interaction term. For situations where G consists of multiple SNPs (e.g., more than 10), as in the case of the EAGLE lung cancer study, we propose the following Bayesian model that simultaneously searches for the optimal partition of the genotype space and estimates the unknown parameters in the corresponding risk model (1).
The Bayesian model is built up in a hierarchic framework. We first describe our model by assuming K, the total number of clusters, is known. We will describe how to choose K later. Suppose there are H types of genotype configurations observed in the sample, labeled as genotype 1, 2, …, H. We define the latent genotype allocation vector An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e016.jpg, with An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e017.jpg, being the cluster assignment for genotype h, An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e018.jpg. For subject i, we denote its genotype id by An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e019.jpg. Given the allocation vector An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e020.jpg and the set of coefficients An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e021.jpg for the disease model (1), the probability of subject i having the disease outcome is
A mathematical equation, expression, or formula.
 Object name is pgen.1002482.e022.jpg
(2)
In the above model specification, we use the prospective likelihood function (2) for observed case-control data, which were collected under a retrospective sampling scheme given the disease outcome. The use of the prospective likelihood function can be partially justified by the general results from Staicu [23] and Seaman and Richardson [24]. They showed the equivalence of prospective and retrospective analysis in the Bayesian framework in the sense that both approaches could yield the identical marginal posterior distribution of the log odds ratio under analyses with properly specified priors. In model (2), the effect of E varies with G. Thus we call it the Bayesian risk model allowing for G-E interaction. As a comparison in the analysis, we also consider a model assuming a homogeneous effect from E, which is defined as
A mathematical equation, expression, or formula.
 Object name is pgen.1002482.e023.jpg
(3)
We call this model the Bayesian risk model without G-E interaction. In what follows, we will describe methods for fitting model (2), the one allowing for G-E interaction. Similar procedures can be applied to model (3).
To model the distribution of the allocation vector z, we first choose a similarity metric to define the spatial contiguity between any two genotypes. Let J denote the total number of considered SNPs within the region, with the genotype at a given SNP being coded as 0, 1, or 2 according to the number of copies of the minor allele. Let genotypes h and An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e024.jpg have the configurations An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e025.jpg and An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e026.jpg, where An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e027.jpg is the genotype at the jth SNP for the multilocus genotype h. We first define the distance between them as
A mathematical equation, expression, or formula.
 Object name is pgen.1002482.e028.jpg
where An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e029.jpg is the variance for the genotype at SNP j observed in the sample, with An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e030.jpg being the genotype at SNP j for subject i, and An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e031.jpg. Then we define An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e032.jpg if An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e033.jpg is among the 4 (distinctive) genotypes closest to genotype h, and An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e034.jpg is among the 4 genotypes closest to genotype An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e035.jpg; An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e036.jpg if An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e037.jpg (or h) is among the 4 genotypes closest to genotype h (or An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e038.jpg) but this is not true in both cases; and An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e039.jpg for all other cases.
We model z with the Potts model, which has a regulating parameter An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e040.jpg governing the level of spatial dependency in the cluster assignment. The Potts model has the following form:
A mathematical equation, expression, or formula.
 Object name is pgen.1002482.e041.jpg
where An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e042.jpg, with An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e043.jpg being the indictor function, i.e., An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e044.jpg if An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e045.jpg and 0 otherwise, and where
A mathematical equation, expression, or formula.
 Object name is pgen.1002482.e046.jpg
is the log normalizing constant. Under the Potts model with An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e047.jpg, the cluster assignments are allocated independently for different genotypes. When An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e048.jpg, the cluster assignments for two neighboring genotypes h and An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e049.jpg (i.e., two genotypes with An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e050.jpg) are correlated. The level of correlation (spatial dependence) increases with An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e051.jpg. For example, under the genotype configuration observed in the EAGLE study and An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e052.jpg, the average probability that any two neighboring genotypes are allocated to the same cluster is 0.5 when An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e053.jpg. It increases to 0.83, and 0.97 for An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e054.jpg and 1.2, respectively. More discussions of the Potts model can be found in [6].
We need to specify our prior models for An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e055.jpg and An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e056.jpg. In this paper, we consider the normal distribution with a mean of 0 and a variance of 4 or the uniform distribution on the interval of An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e057.jpg as the prior for each parameter in An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e058.jpg. We describe the appropriateness of those priors for the prospective likelihood model in the Discussion Section. Both priors are very uninformative and generate similar conclusions on the EAGLE study and simulated datasets. Therefore we present only results based on the normal prior in the following discussions. Following Green and Richardson [6], the prior distribution An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e059.jpg for An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e060.jpg is set to be a uniform distribution on the interval An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e061.jpg, which covers an appropriate region of An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e062.jpg such that the resulting class of Potts models are flexible enough to capture a wide range of spatial dependence. We note that An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e063.jpg cannot be too large. If An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e064.jpg is over a critical point, the corresponding Potts model would essentially force almost all elements into the same cluster, a well known phenomenon for the Potts model called phase transition property [25], and in this situation, the MCMC simulation tends to get stuck. We did some experiments to explore the setting of An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e065.jpg for the Potts model based on the neighborhood configuration observed in the EAGLE study. We found the value An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e066.jpg induces a high level of spatial dependence, with the average probability that any two neighboring genotypes are allocated to the same cluster being 0.97 at An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e067.jpg; and when An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e068.jpg, the average probability goes to 0.99, which indicates an extremely high level of spatial dependence for the Potts model. Based on these observations, we decided to set An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e069.jpg in our EAGLE study application, as well as in simulation studies that assume the same neighborhood structure as the EAGLE study. We consider only a uniform prior for An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e070.jpg since in practice we usually do not know which level of spatial dependence is more likely than the others. But the algorithm described below can certainly be used with other prior functions if necessary.
Putting all the foregoing models together, we can express the joint distribution of all variables as
A mathematical equation, expression, or formula.
 Object name is pgen.1002482.e071.jpg
where An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e072.jpg. The inference (for a fixed total number of clusters K) on An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e073.jpg, and z can be based on the following MCMC algorithm.
The MCMC Algorithm
Updating coefficients An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e074.jpg
The full conditional function for coefficients An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e075.jpg in the risk model can be written as
A mathematical equation, expression, or formula.
 Object name is pgen.1002482.e076.jpg
(4)
We can use the standard Metropolis-Hastings (MH) steps to update An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e077.jpg, conditioned on the current values of other parameters. The detailed algorithm is given in Text S1.
Updating the allocation vector An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e078.jpg
Following Green and Richardson [6], we can update the allocations z using a Gibbs kernel; that is, for the genotype h, its cluster assignment is updated by drawing from the following full conditional distribution,
A mathematical equation, expression, or formula.
 Object name is pgen.1002482.e079.jpg
(5)
where An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e080.jpg is the sum of similarity scores between the genotype h and other genotypes currently assigned to cluster k.
Since the sampling space for z is discrete, the standard Gibbs sampler can be improved by the Metropolized Gibbs sampler [26]. Thus we choose this sampler for updating the allocation vector. A summary of the algorithm is given in Text S1.
Updating the regulating parameter An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e081.jpg
The regulating parameter An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e082.jpg has the following full conditional distribution:
A mathematical equation, expression, or formula.
 Object name is pgen.1002482.e083.jpg
(6)
If the standard MH algorithm is used, updating An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e084.jpg would involve the evaluation of the normalizing constant An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e085.jpg for the Potts model, which is prohibitive when the dimension of z is large. Green and Richardson [6] chose to restrict An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e086.jpg to a pre-specified finite set of values; they used the thermodynamic integration approach [27] to estimate An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e087.jpg for a given value of K. Those estimates were then used in the MCMC sampler. The estimate of An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e088.jpg at pre-specified grid points might lead to biased Monte Carlo estimates of An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e089.jpg and other parameters.
Here we propose to use the recently developed Monte Carlo Metropolis-Hastings algorithm (MCMH) [12] to sample An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e090.jpg from An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e091.jpg. This new algorithm replaces the ratio of normalizing constants at any two values of An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e092.jpg by a Monte Carlo estimate, which is obtained through a set of m auxiliary samples, in the MCMC iterations, thus allowing us to consider An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e093.jpg on its original continuous scale instead of on a finite number of pre-specified points. As shown in [12], this algorithm ensures that the Monte Carlo estimate of the parameter will converge to its posterior mean. In our numeric experiments, we find it is appropriate to choose the number of auxiliary samples m to be between 50 and 100. A summary of the algorithm for updating An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e094.jpg is given in Text S1.
Posterior Inference
In our simulation studies and the real data application, we find the MCMC algorithm generally converges after 100.000 iterations. Below we describe a procedure for determining the number of clusters, and an ensemble averaging method for the identification of the cluster assignment based on the MCMC samples.
Determining the number of clusters
We choose to use the deviance information criterion (DIC) proposed by Spiegelhalter et al [28] for determining the number of clusters. For a given number of clusters K, define the deviance An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e095.jpg as
A mathematical equation, expression, or formula.
 Object name is pgen.1002482.e096.jpg
We can calculate the posterior expected deviance An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e097.jpg by averaging the deviance calculated at samples of An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e098.jpg generated by MCMC output. We calculate the deviance An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e099.jpg at the posterior mean of the parameters as
A mathematical equation, expression, or formula.
 Object name is pgen.1002482.e100.jpg
where An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e101.jpg and An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e102.jpg are the posterior means of the coefficients assigned to subject i; An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e103.jpg is the posterior mean for An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e104.jpg. The An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e105.jpg for the model with K clusters is then calculated as
A mathematical equation, expression, or formula.
 Object name is pgen.1002482.e106.jpg
To determine the number of clusters, we run the algorithm with different values of K (e.g., An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e107.jpg, with An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e108.jpg) and compute their DIC values. The DIC criterion favors models with small DIC values. To take the Monte Carlo variation into the consideration, instead of choosing the K with the smallest DIC, we adopt the +1 standard error (SE) rule originally proposed for the tree model selection [29]. To use this rule, we run the MCMC algorithm 20 times, with different random seeds for each considered value of K, and then pick the optimal number of clusters An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e109.jpg as the smallest one such that
A mathematical equation, expression, or formula.
 Object name is pgen.1002482.e110.jpg
(7)
where An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e111.jpg is the average of the values of DIC measured at K over 20 runs, An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e112.jpg, and An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e113.jpg is the Monte Carlo standard error estimated for An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e114.jpg based on 20 runs.
Based on our numerical experiments, we found that the Monte Carlo standard error usually is less than 1 if the MCMC chain converges. So, if there is only one run for each K, we recommend picking An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e115.jpg as the smallest one such that
A mathematical equation, expression, or formula.
 Object name is pgen.1002482.e116.jpg
(8)
We use this rule, hereafter called the +1 rule, to select the optimal number of clusters in simulation studies.
Identifying the cluster assignment
After the determination of the number of clusters An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e117.jpg, it is usually helpful to identify the consensual cluster assignment rule that assigns each genotype to one of the An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e118.jpg clusters. We can also use this partition to assign each subject to one of the clusters based on his or her genotype's assignment. Here we adopt the ideas from Liang [13] and Molitor et al [14] to find such a partition. Based on the samples generated from MCMC runs under the An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e119.jpg model, we let An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e120.jpg be the proportion of times that the genotypes h and An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e121.jpg are assigned to the same cluster. We then use An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e122.jpg as the dissimilarity metric and apply the PAM (partitioning around medoids) method [30] to partition genotypes into An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e123.jpg clusters. Simulation studies presented later show this clustering algorithm works quite well in identifying the appropriate clusters.
A Resampling-Based Test for Gene–Environment Interaction
It is usually desirable to have a formal statistical test or decision rule for inference regarding the presence of an interaction. Here we propose a resampling-based test for this purpose. First we fit model (2), the Bayesian risk model allowing for G-E interaction, under various numbers of clusters. Then we use the +1 rule to identify An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e124.jpg, the optimal number of clusters that is not less than 2, and the corresponding consensual cluster assignment L. We require An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e125.jpg for this interaction test because the interaction test is not defined for An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e126.jpg. We use the maximum likelihood estimate (MLE) to establish the following logistic regression model,
A mathematical equation, expression, or formula.
 Object name is pgen.1002482.e127.jpg
(9)
where An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e128.jpg is the cluster assignment for subject An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e129.jpg, given by the consensual cluster assignment L. This model includes the main effects of L and E, as well as their interactions. We can conduct a likelihood ratio test comparing model (9) with the similar model without the interaction terms and obtain the corresponding “P-value”, denoted by An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e130.jpg, based on the Chi-squared distribution with An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e131.jpg degrees of freedom (df). Clearly, this “P-value” An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e132.jpg tends to overestimate the significance level of the interaction, as the variable An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e133.jpg is data-driven, but a small value for An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e134.jpg provides evidence against the null. We can use An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e135.jpg as the test statistic and apply the following resampling-based procedure to evaluate its significance level.
  • Apply the MCMC procedure to fit model (3), the Bayesian risk model without G-E interaction, on the observed data and identify An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e136.jpg, the optimal number of clusters, using the +1 rule, as well as the corresponding consensual cluster assignment.
  • Use MLE to fit the following logistic regression model based on the observed data,
    A mathematical equation, expression, or formula.
 Object name is pgen.1002482.e137.jpg
    (10)
    where An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e138.jpg is the cluster assignment for subject An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e139.jpg, given by the consensual cluster assignment identified in Step 1, and An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e140.jpg, An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e141.jpg, and An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e142.jpg are the estimated coefficients.
  • Use the model given by (10) to generate B sets of bootstrap null datasets. Each null dataset is a copy of the observed dataset, except the outcome for every subject is regenerated according to the probability model given by (10).
  • For the bth null dataset, An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e143.jpg, obtain the test statistic An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e144.jpg using the same procedure used above for obtaining An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e145.jpg.
  • The estimated P-value for An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e146.jpg is given by An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e147.jpg
In Steps 1 and 2 we establish the Bayesian risk model under the null hypothesis that there is no G-E interaction and the corresponding logistic regression model. We use the fitted logistic regression model (10) to generate multiple null datasets in Step 3 based on the parametric bootstrap procedure [31]. In Step 4, for the bth generated null dataset, we first apply the MCMC procedure to establish the Bayesian model given by (2) and next identify the optimal number of clusters with the +1 rule, as well as the corresponding consensual cluster assignment. Then we fit the corresponding logistic regression model with G-E interaction and obtain the test statistic An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e148.jpg from the likelihood ratio test.
The EAGLE Study
We used data generated by the lung cancer GWAS in the EAGLE study [15] with 1920 lung cancer cases and 1979 population controls as the basis for our simulation studies and real data applications. We focused on the chromosome region 15q25.1 between 76.5 Mb and 76.72 Mb, with the boundary defined by loci where the recombination rate is relatively high. This region covers all replicated loci relating to smoking behavior or lung cancer risk. We have genotypes on 32 SNPs in the region that have a minor allele frequency (MAF) larger than 4% (estimated in 1979 EAGLE control samples). We removed 17 redundant SNPs, leaving a minimal set of 15 SNPs where the pairwise An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e149.jpg was always less than 0.8. We used genotypes on these 15 tagging SNPs to represent each subject's genetic variation pattern in the region. The reason for removing redundant SNPs is to ensure that the similarity measure between any two types of multilocus genotypes is not dominated by a set of SNPs in high linkage disequilibrium. The summary of the 15 chosen tagging SNPs is given in Table 1.
Table 1
Table 1
Summary of 15 tagging SNPs chosen for the EAGLE study.
Simulation Studies: Performance of the Bayesian Model
We conducted simulation studies to evaluate the performance of the proposed method for fitting the Bayesian model allowing for G-E interaction. In the simulation study we were interested in studying the interaction between a binary environment risk factor (An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e150.jpg or 1) and genetic variants (G) within a candidate region. We used genotypes at 15 tagging SNPs (Table 1) in 15q25.1 observed in the EAGLE study to represent the joint genotype distribution for the simulated population, which consisted of 766 distinct multilocus genotypes. We chose the 2nd, 6th, and 10th SNPs listed in the Table 1 as the functional SNPs, and divided the genotype space into the following three regions according to the total number of risk alleles (assuming the minor allele to be the high-risk allele) among the 3 functional SNPs: region I, consisting of genotypes with An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e151.jpg; regions II, consisting of genotypes with An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e152.jpg; and region III, including genotypes with An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e153.jpg. We conducted a principal component (PC) analysis on subjects from the EAGLE study with genotypes at the 15 SNPs as their coordinates. Figure 1 shows how genotypes (subjects) in each of the three regions were distributed in the first 2-PC space, with regions I, II, and III in green, blue, and red, respectively.
Figure 1
Figure 1
The partition of the genotype space in the simulation study.
The disease risk models we considered had the form given by (1). Their definitions are given in Table 2. Under Model An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e154.jpg there was no genetic effect and no interaction between G and E, and thus there was no risk heterogeneity in the genotype space. Under An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e155.jpg and An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e156.jpg, coefficients An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e157.jpg and An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e158.jpg had the same clustering pattern. Under models An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e159.jpg, the risk heterogeneity patterns for An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e160.jpg and An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e161.jpg were not matched, unlike those under model An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e162.jpg and An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e163.jpg. In model An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e164.jpg, the two clusters defined by An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e165.jpg were region I, and regions II and III combined, while the two clusters defined by the effect of An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e166.jpg were regions I and II combined, and region III.
Table 2
Table 2
List of disease risk models considered in the simulation study for evaluating the Bayesian model.
We assumed that the environmental exposure status E (0 or 1) and G were correlated in the general population. The distribution of E depended on G in the following way: for a subject with genotype in region I, the probabilities of being unexposed (An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e192.jpg) or exposed (An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e193.jpg) were 0.7 and 0.3; for a subject with genotype in one of the other two regions, those probabilities were 0.4 and 0.6 for An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e194.jpg and An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e195.jpg. Thus the distribution of E was quite different for subjects with different genotypes.
Under each model, we simulated 50 datasets representing a case-control study with 1500 cases and 1500 controls. We ran the MCMC algorithm with 2,000,000 iterations with the first 1,000,000 iterations being discarded. We used an algorithm similar to that described in [32] to simulate the case-control study. Note that under the case-control sampling scheme, we do not need to specify a value for An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e196.jpg. Instead, we just need to know the values of An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e197.jpg, An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e198.jpg, in order to simulate datasets from a case-control study.
For each simulated dataset, we applied our method with An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e199.jpg auxiliary samples, with the number of clusters K ranging from 1 to 8. We used the +1 rule defined by (8) to identify An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e200.jpg, the optimal number of clusters. Table 3 provides a summary of the number of clusters identified over 50 simulated datasets under each risk model. We can see from the table that the +1 rule performs quite well in identifying the right number of clusters, even in situations where there is no clustering structure (i.e., the true number of clusters, An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e201.jpg, is 1).
Table 3
Table 3
Performance of the +1 rule for identifying the number of clusters in the simulation study.
We evaluated the performance of the algorithm for cluster assignment by comparing the cluster assignment estimated at An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e212.jpg with the true underlying cluster assignment chosen by the simulation design. For model An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e213.jpg, the clustering patterns for An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e214.jpg and An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e215.jpg were not matched. In this case we treated the finer partitioning (given by Figure 1) that accommodates the clustering patterns of both An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e216.jpg and An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e217.jpg as the true one. The accuracy of the estimated cluster assignment was measured as the proportion of subjects being assigned to the same cluster by both assignments (the estimated one and the true one). The accuracy summary over 50 replications under various considered models (except An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e218.jpg, the model with no clustering structure) is given in Table 4. It indicates that the cluster assignment algorithm appears to be able to partition the subjects (and genotypes) into the proper subgroups, provided that the correct number of clusters can be identified.
Table 4
Table 4
Performance of the algorithm for the cluster assignment.
We also evaluated the accuracy of the estimated coefficients (An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e222.jpg and An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e223.jpg). Based on the true risk model (1), subject i with genotype An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e224.jpg was assigned to one of the risk models. We considered coefficients An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e225.jpg and An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e226.jpg in that risk model to be the true coefficient values for this subject. Thus, subjects with their genotypes belonging to the same cluster would share the same true coefficient values. We used An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e227.jpg, the posterior median of An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e228.jpg assigned to subject i based on MCMC samples generated under An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e229.jpg, as the estimates for the underlying coefficients. The number of clusters An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e230.jpg was estimated by the +1 rule, as described previously. Since the odds for the genetic effect is not identifiable under the case-control design, we were interested only in the difference in An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e231.jpg between two groups. To present the result for each experiment, we shifted the value of An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e232.jpg, the posterior median of An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e233.jpg for subject i, by a constant value, which was chosen as An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e234.jpg, the median of An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e235.jpg among subjects in true cluster 1. As a result, the median level of the shifted posterior median (we still represent it as An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e236.jpg) among subjects in cluster 1 is 0. In Figure 2, Figure 3, Figure 4, and Figure 5, we present summaries of An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e237.jpg and An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e238.jpg for each of the 50 experiments under models An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e239.jpg and An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e240.jpg. Summarized results for model An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e241.jpg are given in Figures S1 and S2. Each boxplot is a summary of An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e242.jpg or An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e243.jpg among subjects in a true underlying cluster. From those figures, we can see that the estimates align with their true values quite well. Notice that these estimates were obtained under the model with the number of clusters estimated by the +1 rule.
Figure 2
Figure 2
Boxplots of the posterior medians of the intercept (An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e244.jpg) for subjects within each true cluster from each of 50 datasets simulated under the model An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e245.jpg.
Figure 3
Figure 3
Boxplots of the posterior medians of the log odds ratio (An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e249.jpg) for subjects within each true cluster from each of 50 datasets simulated under the model An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e250.jpg.
Figure 4
Figure 4
Boxplots of the posterior medians of the intercept (An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e253.jpg) for subjects within each true cluster from each of 50 datasets simulated under the model An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e254.jpg.
Figure 5
Figure 5
Boxplots of the posterior medians of the log odds ratio (An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e259.jpg) for subjects within each true cluster from each of 50 datasets simulated under the model An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e260.jpg.
We inspected the algorithm's convergence using the Gelman and Rubin's diagnostic plot [33], as implemented in the CODA R package [34]. For each model, we checked the convergence on the first 5 simulated datasets used in the above simulation studies, with 5 independent runs on each dataset. We found that the proposed algorithm can converge within 100,000 iterations, with the estimated shrinkage factor falling below the recommended threshold of 1.1. We also show in Figures S3 and S4 the posterior distributions for An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e264.jpg, resulting from each of 5 independent runs on the first simulated dataset under models An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e265.jpg, and An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e266.jpg. It is evident from these plots that we can obtain very consistent posterior distributions for parameters of interest among different runs on the same data.
Simulation Studies: Performance of the Resampling-Based Test
We conducted a simulation study to evaluate whether the proposed resampling-based test can maintain the proper type I error rate. We considered a disease risk model that had the main effects from G (with OR = 4 for genotypes falling into regions II and III vs. those in region I) and E (with a common OR of 4 for An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e267.jpg vs. An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e268.jpg), with no interaction between G and E. Regions are defined in Figure 1. We assumed a study sample size of 600 cases and 600 controls, and simulated 1000 datasets under the considered risk model as did before. For each dataset, we ran the resampling-based test with 1000 bootstrap steps for the estimation of the P-value, allowing the number of clusters to vary from 2 to 5. To reduce the computing time further, we ran the MCMC algorithm for 300,000 iterations with the burn-in period consisting of the first 200,000 iterations for each bootstrapped sample (as 200,000 iterations appear to be enough to ensure the convergence of the MCMC algorithm). We found that the proposed resampling-based test can maintain the proper type I error in the considered scenario, with estimated false positive rates of 0.055 and 0.097 under nominal levels of 0.05 and 0.10, respectively.
We compared the power of the proposed resampling-based test with two other standard interaction tests, the minP-SNP and minP-PC tests. Both test statistics are based on the minimum P-value observed on a set of univariate G-E interaction tests, with their significant levels being evaluated through a resampling-based procedure. The minP-SNP test is based on the set of single SNP-environment interaction tests, with each SNP-environment interaction test statistic being derived from the standard likelihood ratio test comparing two logistic regression models with and without the SNP-environment interaction term. The SNP effect is modeled with a categorical variable with three levels so each SNP-environment interaction test considered in the minP-SNP test is a 2 df test. The minP-PC is based on a set of tests that evaluate the interaction between a single principal component (PC) and the environment variable. PCs are derived from the principal component analysis of genotypes on all considered SNPs. Similar to the minP-SNP test, each PC-environment interaction test is derived from the likelihood ratio test comparing two logistic regression models with and without the interaction term. The PC effect is model as a continuous variable. Both minP-SNP and minP-PC were based on 15 univariate tests in the simulation study as there were a total of 15 SNPs in the considered chromosome region.
We evaluated the power under six different disease risk models, including An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e269.jpg, An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e270.jpg, and An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e271.jpg defined in Table 2, and the three additional models An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e272.jpg, An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e273.jpg, and An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e274.jpg. Model An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e275.jpg and An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e276.jpg had just one functional SNP (the 10th SNP in Table 1). Model An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e277.jpg had 2 clusters, with coefficients in the formula (1) being An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e278.jpg for genotypes satisfying the condition An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e279.jpg (cluster 1), and An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e280.jpg for An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e281.jpg, or 2 (cluster 2). Model An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e282.jpg had 3 clusters, with coefficients An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e283.jpg for An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e284.jpg (cluster 1), An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e285.jpg for An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e286.jpg (cluster 2), and An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e287.jpg for An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e288.jpg (cluster 3). Model An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e289.jpg adopted a 2-cluster pattern observed in the analysis of the EAGLE study described later, with clusters 1 and 2 consisting of genotypes in red and blue colors, respectively (Figure 6), and with An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e290.jpg for cluster 1, and An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e291.jpg for cluster 2. The correlation between E and G was defined similarly as before. For a subject with genotype in cluster 1, the probabilities of being unexposed (An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e292.jpg) or exposed (An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e293.jpg) were 0.7 and 0.3; for a subject with genotype not in cluster 1, those probabilities were 0.4 and 0.6.
Figure 6
Figure 6
Cluster assignment for the EAGLE study.
Under each disease model, we simulated 500 datasets, with each consisting of 600 cases and 600 controls. The summary for the power comparison results is given in Table 5. It can be seen from the table that the proposed test has a clear advantage over two other standard interaction tests, especially when the underlying clustering pattern in the disease risk cannot be properly approximated by a single SNP or PC (e.g., under the model An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e294.jpg). Even under the model An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e295.jpg where the single SNP-environment interaction test based on the 10th SNP is most optimal, due to the multiple comparison adjustment, the minP-SNP test is only slightly more powerful than the proposed test. Under the model An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e296.jpg where the functional SNP (the 10th SNP) has a dominant effect in its interaction with E, the minP-SNP test compares less favorably with the proposed test since each of single SNP-environment interaction test considered in the minP-SNP global test spends one more df than necessary (as there are only two cluster in the model An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e297.jpg). The minP-PC test has the worst overall performance as it is very sensitive to its underlying assumption that the genetic effect is linearly correlated with one of the PC direction.
Table 5
Table 5
Power comparison under the type I error rate of 0.05.
Application in the EAGLE Study
We applied the proposed method to study the joint effect of cigarette smoking intensity (number of packs per day) and genetic variants in chromosome region 15q25.1 on lung cancer risk, using data generated by the EAGLE study. We focused on former and current smokers who had been genotyped on the 15 tagging SNPs. We also removed, as outliers, 8 subjects who had smoked more than 3 packs of cigarette per day. The final dataset for our analysis consisted of 1326 controls and 1720 cases. In the analysis we treated smoking intensity as a continuous variable and adjusted for the effects of gender and of age at diagnosis (categorized as: < = 60, 61–70, >70). We used a Bayesian model that allowed for G-E interaction, unless specified otherwise.
To determine the number of clusters, we ran the MCMC algorithm 20 times with different random seeds for each K, An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e304.jpg, in order to estimate the Monte Carlo standard error for DIC. Figure 7 shows the DIC values for each K over 20 replications. Based on the 1 SE rule given by (7), the optimal number of clusters was found to be 2, with its averaged DIC value being 3810.5. The partitioning of subjects into 2 clusters based on our proposed clustering algorithm is very consistent among 20 replications. The discrepancy rate between assignments from any two runs, defined as the proportion of subjects being assigned to two different clusters, is less than 1.4% under An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e305.jpg.
Figure 7
Figure 7
DIC plots for the Bayesian risk model allowing for gene–environment interaction.
Below we present the posterior summary based on a single run of our algorithm. To present the summary result, we first conducted a PC analysis on the case-control samples using genotypes at the 15 tagging SNPs as coordinates. In Figure 6, we plotted subjects by their first 2 PCs, with different colors representing their cluster assignments under An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e306.jpg. The cluster assignment was performed with the ensemble averaging method described above. Since subjects with the same genotype were represented by one point in the first 2-PC space, we can think of each point as either a unique genotype or a group of subjects sharing that genotype. There are 2240 subjects with 410 unique genotypes grouped into one cluster (shown in red in Figure 6) and 806 subjects with 252 unique genotypes grouped into another cluster (shown in blue in Figure 6). Notice that the two clusters are defined in term of estimated risk coefficient values (An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e307.jpg), but not in term of genotypes distribution in the PC space. That is why these two clusters do not appear to be well separated in the PC space.
To summarize the effect of smoking on a subject with genotype h, An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e308.jpg, we focused on An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e309.jpg, the posterior median of An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e310.jpg, with An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e311.jpg being the coefficient for smoking in the risk model assigned for a subject with genotype h. We can interpret An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e312.jpg as the posterior median of the OR associated with one more pack of cigarettes per day for a subject with genotype h. To summarize the genetic effect of genotype h, we used An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e313.jpg, the ratio of the posterior median of An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e314.jpg versus the posterior median of An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e315.jpg, with An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e316.jpg being the intercept for the risk model assigned for a subject with genotype h and An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e317.jpg being the chosen reference genotype. We chose the reference genotype An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e318.jpg as the one having the lowest posterior median of An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e319.jpg. We can interpret An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e320.jpg as the posterior median OR between genotype h and the reference genotype An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e321.jpg.
In Figure 8, we show a smoothed surface plot for the smoking effect measured by An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e322.jpg, and the genetic effect measured by An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e323.jpg for each genotype in the first 2-PC space, based on models run under An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e324.jpg. The smooth surface was estimated by the kriging method with each genotype's top 5 PCs (which account for over 85% of the total variation) as predictors. The plots were generated using the functions provided in the R package called “fields” [35]. It is evident from Figure 8 that neither the smoking effect nor the genetic effect is uniformly distributed over the genotype space. The smoking effect on a subject depends on his or her genotype. It is considerably lower on subjects who have their genotypes projected on the lower part of the PC space than on subjects with their genotypes projected elsewhere.
Figure 8
Figure 8
Smoothed surface plots of the posterior medians of the odds ratios for the genetic and smoking effects on the space of the first two principal components.
Some understanding of the 2nd PC is helpful for interpreting the patterns observed in Figure 8. From Table 1, we can see that the 2nd PC is driven mainly by the 8 SNPs with absolute loading values larger than 0.18, with the remaining having loading values less than 0.07. These 8 SNPs also turn out to be the ones that are most significantly associated with lung cancer risk (Table 1). We point out the fact that the loading value for each of the 8 SNPs is negative if the SNP's major allele is the high-risk allele, positive if its minor allele is the high-risk allele. As a result, a genotype's 2nd PC coordinate is positively correlated with its total number of risk alleles among the 8 SNPs (see Figure S5). Genotypes with smaller 2nd PC coordinates tend to have fewer high-risk alleles and are expected to have smaller ORs than genotypes having larger 2nd PC coordinates.
As a comparison, we also fit model (3), the Bayesian model without G-E interaction. The optimal model based on the 1 SE rule was again achieved at An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e325.jpg, with its averaged DIC value being 3817.5 over 20 runs (Figure S6). The DIC is noticeably higher than that obtained under the Bayesian model allowing for G-E interaction (DIC = 3810.5). This suggests that the model allowing for G-E interaction fits the data better than the model without the G-E interaction.
Finally, to demonstrate the existence of G-E interaction further, we applied the resampling-based test described in the Methods section. The observed test statistic was An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e326.jpg. We applied the resampling-based test by allowing the number of clusters to vary from 2 to 5. The estimated P-value based on 2000 bootstrap steps was 0.016, suggesting that there is a significant G-E interaction. On the other hand, for each of the 32 relatively common SNPs (MAF>0.04) in this considered 15q25.1 region, we conducted the standard SNP-smoking interaction test (2 df) based on the logistic regression model by treating the genotype as a three-level categorical variable. The smallest nominal P-value we observed was 0.021. The global minP-SNP test had a P-value of 0.29, which was well above the 0.05 level. We also conducted the PC-smoking interaction test by modeling each PC as a continuous variable. The smallest nominal P-value was 0.058. The P-value from the global minP-PC test was 0.62.
Our new method can evaluate gene–environment interaction at the gene/region level by integrating information observed on multiple SNPs in the considered gene/region with measures of environmental exposure. This method reduces the impact of loss of efficiency and bias from the misclassification error inherent in the single-marker approach that studies the environmental risk factor and one SNP at a time. The method provides a coherent inference framework that allows us to evaluate the environmental effect on different strata defined by the multi-locus genotype. A heterogeneous environmental effect across different strata would signal the presence of gene–environment interaction. We also propose a resampling-based test to formally test for the existence of gene–environment interaction.
Genetic variations within the 15q25.1 region have been shown to be associated with both lung cancer risk and smoking behaviors, such as the smoking intensity. Our analysis based on the EAGLE case-control study indicates that the smoking effect varies according to the subject's genetic makeup in the 15q25.1 region. The proposed resampling-based test also supports the existence of gene–environment interaction (P-value = 0.016). On the other hand, two conventional tests of gene–environment interaction based on the single-marker and single-PC approaches are far from significant. This highlights the advantage of our proposed method over standard approaches.
Accurate assessment of the environment risk exposure in the evaluation of gene–environment interaction is as important as identification of functional genetic variants or their proper surrogates [36]. In the EAGLE population-based case-control study, the information on smoking collected near the time of diagnosis is likely to provide a more accurate measure of risk exposure than information collected in other prospective cohort studies, such as the Prostate, Lung, Colorectal, and Ovarian (PLCO) Screening Trial [37], which does not reflect subsequent changes in smoking behavior like quitting. For example, we observed a much larger OR for smoking one more pack of cigarette per day (3.7, z statistic = 15.58) in the EAGLE study than in a lung cancer case-control study nested within the PLCO cohort (1.84, z statistic = 8.87), which includes 1390 lung cancer cases and 1924 controls. We also could not find evidence for smoking-15q25.1 interaction in this PLCO nested case-control study by using our proposed method. The difference in the smoking OR estimates and the absence of gene–environment interaction evidence using our method in the PLCO study may be a consequence of greater misclassification error in the smoking information assessment in the cohort study (PLCO) than in the case-control study (the EAGLE study).
In our method, we adopted the Potts model for the latent allocation vector for cluster assignment, as did Green and Richardson [6]. We used the MCMH algorithm [12] for simulating the regulating parameter of the Potts model. The MCMH algorithm overcomes the intractable normalizing constant problem that cannot be handled by the standard MH algorithm, while ensuring the consistency of the Monte Carlo estimates. Furthermore, this MCMH algorithm can readily be used for Potts models with certain restrictions on the sampling space by modifying the MH step to generate allocation vectors accordingly.
We proposed to use the +1 SE rule (or the +1 rule) based on DIC to identify the optimal number of clusters. We found through simulation studies that this approach works quite well. An alternative approach would be to treat the number of clusters as a random variable and integrate it into a Bayesian model [6]. A reversible jump MCMC algorithm [38] could be used to facilitate the move between sampling spaces with different dimensions. It would be interesting to compare the performance of these approaches, especially in term of their abilities to identify the proper number of clusters.
The proposed procedure relies upon a user-specified similarity metric to define the neighborhood among different genotypes in the Potts model. This neighborhood structure is used to induce the spatial dependency in the cluster assignment. In this paper, for a given genotype, we chose its 4 nearest genotypes as its neighbors. We found that the analysis result was not very sensitive to how the neighborhood is defined as long as the chosen Markov structure can generate an appropriate spatial dependence. For example, we reanalyzed the EAGLE study with two other types of Markov structures: one using the 3 nearest genotypes as neighbors, and the other one using the 5 nearest genotypes as neighbors. We show in Figure S7 the comparison of the posterior medians of the genetic effect (An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e327.jpg) and the smoking effect (An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e328.jpg) estimated for each subject between each of the new runs and the original runs under An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e329.jpg. It is clear that results from these three analyses are quite similar.
We used the prospective likelihood model in the Bayesian framework for case-control studies, even though the data were collected retrospectively according to a subject's disease status. According to [23], [39], given certain priors for parameters in the retrospective model, we can derive corresponding priors for the prospective model parameters that yield the same marginal posterior distributions as their retrospective counterparts. In this paper we consider both normal and uniform distributions as priors for the prospective model parameters. Although we cannot derive explicitly their corresponding priors for the retrospective model, our simulation studies demonstrated that the proposed prospective approach indeed had the desired performance when applying to data generated from case-control studies. The normal prior has also been used with the prospective likelihood model on case-control studies in other contexts (e.g., [40], [41]).
We have created an R package called BaDGE (Bayesian model for Detecting Gene Environment interaction) implementing the proposed Bayesian model and the associated post-processing procedures. The package is freely available from the website http://dceg.cancer.gov/bb/tools/badge. Currently, we consider only binary or continuous environmental exposure variable. It is straightforward to expand the algorithm to deal with a categorical (with more than 2 levels) environmental variable. To use the program, the user needs to specify priors (normal or uniform distribution) for parameters in the risk model and a prior (a uniform distribution) for the regulating parameter in the Potts model. The program will be expanded further to incorporate other prior functions. The running time for 200,000 iterations using 50 auxiliary samples in the MCMH algorithm on a dataset of 1000 cases and 1000 controls, with approximate 450 unique genotypes, is about 14 minutes on a Linux machine with the 2.8 GHz AMD Opteron processor. For a dataset with a large number of genotypes (e.g., over 1000), we can reduce the computing time by first dividing the whole genotype space into a few hundreds of subgroups through the PAM clustering algorithm [30] and then treating subgroups as genotypes in the proposed MCMC procedure. For example, the running time on the same testing example mentioned above decreases to 8 minutes if we regroup the genotypes into 250 unique subgroups. Another way to reduce the total number of genotypes is to limit tagging SNPs to those with a relatively large minor allele frequency. The resampling-based test could be computationally intensive for a dataset like the EAGLE study. We are still investigating whether it is possible to replace the computationally intensive resampling-based procedure with a one-step multiple comparison adjustment approach, similar to one used in [42], for the assessment of the statistical significance level.
Comparing to the standard single-marker or principal component based approaches, our proposed method is more computationally intensive, but it has several important advantages. First, it offers a more flexible way to model gene–environment interaction, especially complicated ones that cannot be depicted properly by the single-marker or PC based approaches, such as in situations where genetic variants (might or might not be directly genotyped) in multiple loci within the considered region interplay with the environment risk factor. Second, it provides an estimate of the environmental effect on subjects with a given joint genotype profile. This could be potentially useful to generate new hypotheses on how the gene and environment risk factor interacts. Third, as shown in the simulation studies and real application, the proposed resampling-based test derived from the Bayesian model has a more robust performance than the standard single-marker, or PC based testing procedures. For example, in situation where the single marker test is most appropriate, i.e., there is only one functional locus in the considered region, the proposed test is only slightly less powerful than the single-marker test. But it has a considerable power advantage over the standard tests when the underlying disease risk pattern cannot be properly approximated by a single SNP or PC.
Although our method is described in the context of gene–environment interaction detection, it is in fact quite general. It provides a general strategy for studying the interaction between an observed risk factor and a latent categorical variable not directly observed or clearly defined, but one that can be derived from a set of observed relevant covariates. For example, our method can be used with some minor modifications to evaluate the interaction between smoking behavior (e.g., smoking intensity) and a latent dietary pattern that can be derived from food frequency questionnaires. Also, it is possible to extend our method to study gene-gene interaction by introducing two latent factors to capture the effect of both genes, as was done in [43].
Figure S1
Boxplots of the posterior medians of the intercept (An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e330.jpg) for subjects within each true cluster from each of 50 datasets simulated under the model An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e331.jpg. (a). Boxplots of posterior medians of An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e332.jpg for subjects in cluster 1, with the true value given by the horizontal line in green; (b). Boxplots of posterior medians of An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e333.jpg for subjects in cluster 2, with the true value given by the horizontal line in red; (c). Boxplots of posterior medians of An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e334.jpg for subjects in cluster 3, with the true value given by the horizontal line in red. The posterior median of An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e335.jpg for each subject under a given simulated dataset was shifted by a constant value selected so that the median value of the shifted estimates for subjects in cluster 1 was zero.
(TIF)
Figure S2
Boxplots of the posterior medians of the log odds ratio (An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e336.jpg) for subjects within each true cluster from each of 50 datasets simulated under the model An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e337.jpg. (a). Boxplots of posterior medians of An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e338.jpg for subjects in cluster 1, with the true value given by the horizontal line in green; (b). Boxplots of posterior medians of An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e339.jpg for subjects in cluster 2, with the true value given by the horizontal line in green; (c). Boxplots of posterior medians of An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e340.jpg for subjects in cluster 3, with the true value given by the horizontal line in red.
(TIF)
Figure S3
Posterior distribution comparison among 5 independent runs under the model An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e341.jpg. Plot ij is the posterior distribution summary for the coefficient An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e342.jpg, based on the ith, An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e343.jpg, independent run on a dataset simulated under the model An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e344.jpg.
(TIF)
Figure S4
Posterior distribution comparison among 5 independent runs under the model An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e345.jpg. Plot ij is the posterior distribution summary for the coefficient An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e346.jpg, based on the ith, An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e347.jpg, independent run on a dataset simulated under the model An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e348.jpg.
(TIF)
Figure S5
The correlation between the total number of risk alleles and the 2nd principal components. Each point represents a unique multilocus genotype with its x-coordinate being the total number of risk alleles among those SNPs with high loading values (highlighted in Table 1 at the PC2 column) on the 2nd principal components, its y-coordinate being the 2nd principal component.
(TIF)
Figure S6
DIC plots for the Bayesian risk model without gene–environment interaction. For a given number of clusters, 20 DIC values were obtained by applying the model to the EAGLE study 20 times with different random seeds.
(TIF)
Figure S7
Pairwise correlations of estimates by the algorithm with different neighborhood structures. The MCMC procedure was applied to the EAGLE study using three different Markov structures, M1: using the 3 nearest genotypes as neighbors; M2: using the 4 nearest genotypes as neighbors; and M3: using the 5 nearest genotypes as neighbors. (a) Comparison of the estimated genetic effect (in term of the posterior median of An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e349.jpg) on each subject between the method using M1 and the one using M2; (b) Comparison of the estimated genetic effect between the method using M3 and the one using M2; (c) Comparison of estimated smoking effect (in term of the posterior median of An external file that holds a picture, illustration, etc.
Object name is pgen.1002482.e350.jpg) on each subject between the procedure using M2 and the one using M1; and (d) Comparison of the estimated smoking effect between the method using M3 and the one using M2.
(TIF)
Text S1
MCMC algorithm details.
(DOC)
Footnotes
The authors have declared that no competing interests exist.
This work was supported by the Intramural Research Program of the National Institutes of Health, National Cancer Institute, Division of Cancer Epidemiology and Genetics. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
1. Hindorff LA, Junkins HA, Hall PN, Mehta JP, Manolio TA. A catalog of published genome-wide association studies. 2011. Available at: www.genome.gov/gwastudies. Accessed August, 2011.
2. Lindstrom S, Schumacher F, Siddiq A, Travis RC, Campa D, et al. Characterizing associations and SNP-environment interactions for GWAS-identified prostate cancer risk markers-Results from BPC3. PLoS ONE. 2011;6:e17142. doi: 10.1371/journal.pone.0017142. [PMC free article] [PubMed]
3. Rothman N, Garcia-Closas M, Chatterjee N, Malats N, Wu X, et al. A multi-stage genome-wide association study of bladder cancer identifies multiple susceptibility loci. Nat Genet. 2010;42:978–984. [PMC free article] [PubMed]
4. Spitz MR, Amos CI, Dong Q, Lin J, Wu X. The CHRNA5-A3 region on chromosome 15q24–25.1 is a risk factor both for nicotine dependence and for lung cancer. J Natl Cancer Inst. 2008;100:1552–1556. [PMC free article] [PubMed]
5. Moore JH, Asselbergs FW, Williams SM. Bioinformatics challenges for genome-wide association studies. Bioinformatics. 2010;26:445–455. [PMC free article] [PubMed]
6. Green P, Richardson S. Hidden Markov models and disease mapping. J Am Stat Assoc. 2002;97:1055–1070.
7. Potts RB. Some generalized order-disorder transformations. Cambridge Philos Soc Math Proc. 1952;48:106–109.
8. Thomas DC, Stram DO, Conti D, Molitor J, Marjoram P. Bayesian spatial modeling of haplotype associations. Hum Hered. 2003;56:32–40. [PubMed]
9. Moltchanova EV, Pitkaniemi J, Haapala L. Potts model for haplotype associations. BMC Genet. 2005;6(Suppl 1):S64. [PMC free article] [PubMed]
10. Liu JS. Monte Carlo Strategies in Scientific Computing. New York: Springer; 2002.
11. Robert CP, Casella G. Monte Carlo Statistical Methods. New York: Springer; 1999.
12. Liang F, Liu C, Carroll RJ. Advanced Markov Chain Monte Carlo Methods: Learning from Past Samples. Wiley; 2010.
13. Liang F. Clustering gene expression profiles using mixture model ensemble averaging approach. JP J Biostat. 2008;2:57–80.
14. Molitor J, Parathomas M, Jerrett M, Richardson S. Bayesian profile regrression with an application to the national survey of children's health. Biostatistics. 2010;11:484–498. [PubMed]
15. Landi MT, Chatterjee N, Yu K, Goldin LR, Goldstein AM, et al. A genome-wide association study of lung cancer identifies a region of chromosome 5p15 associated with risk for adenocarcinoma. Am J Hum Genet. 2009;85:679–691. [PubMed]
16. Amos CI, Wu X, Broderick P, Gorlov IP, Gu J, et al. Genome-wide association scan of tag SNPs identifies a susceptibility locus for lung cancer at 15q25.1. Nat Genet. 2008;40:616–622. [PMC free article] [PubMed]
17. Thorgeirsson TE, Geller F, Sulem P, Rafnar T, Wiste A, et al. A variant associated with nicotine dependence, lung cancer and peripheral arterial disease. Nature. 2008;452:638–642. [PubMed]
18. Thorgeirsson TE, Gudbjartsson DF, Surakka I, Vink JM, Amin N, et al. Sequence variants at CHRNB3-CHRNA6 and CYP2A6 affect smoking behavior. Nat Genet. 2010;42:448–453. [PMC free article] [PubMed]
19. Saccone NL, Culverhouse RC, Schwantes-An TH, Cannon DS, Chen X, et al. Multiple independent loci at chromosome 15q25.1 affect smoking quantity: a meta-analysis and comparison with lung cancer and COPD. PLoS Genet. 2010;6:e1001053. doi: 10.1371/journal.pgen.1001053. [PMC free article] [PubMed]
20. Liu JZ, Tozzi F, Waterworth DM, Pillai SG, Muglia P, et al. Meta-analysis and imputation refines the association of 15q25 with smoking quantity. Nat Genet. 2010;42:436–440. [PMC free article] [PubMed]
21. Consortium TaG. Genome-wide meta-analyses identify multiple loci associated with smoking behavior. Nat Genet. 2010;42:441–447. [PMC free article] [PubMed]
22. Caporaso N, Gu F, Chatterjee N, Sheng-Chih J, Yu K, et al. Genome-wide and candidate gene association study of cigarette smoking behaviors. PLoS ONE. 2009;4:e4653. doi: 10.1371/journal.pone.0004653. [PMC free article] [PubMed]
23. Staicu A. On the equivalence of prospective and retrospective likelihood methods in case-control studies. Biometrika. 2010;97:990–996.
24. Seaman SR, Richardson S. Bayesian analysis of case-control studies with categorical covariates. Biometrika. 2001;88:1073–1088.
25. Borgs C, Chayes JT, Frieze A, Kim JH, Tetali P, et al. Torpid mixing of some Monte Carlo Markov chain algorithms in statistical physics; 1999; Washington, DC.
26. Miller P. Alternative to the Gibbs sampling scheme. 1993. Tech. Report, Institute of Statistics and Decision Science.
27. Ogata Y, Tanemura M. Likelihood analysis of spatial point patterns. J Royal Stat Soc, Ser B. 1984;46:496–518.
28. Spiegelhalter DJ, Best NG, Carlin BP, van der Linde A. Bayesian measures of model complexity and fit (with discussion). J R Stat Soc Ser B. 2002;64:583–639.
29. Breiman L, Friedman JH, Olshen RA, Stone CJ. Classification and Regression Trees. Monterey: Wadsworth and Brooks/Cole; 1984.
30. Kaufman L, Rousseeuw PJ. Finding Groups in Data: An Introduction to Cluster Analysis. Hoboken, NJ: Wiley-Interscience; 2005.
31. Efron B, Tibshirani RJ. An Introduction to the Bootstrap. New York: Chapman & Hall; 1993.
32. Yu K, Li Q, Bergen AW, Pfeiffer RM, Rosenberg PS, et al. Pathway analysis by adaptive combination of P-values. Genet Epidemiol. 2009;33:700–709. [PMC free article] [PubMed]
33. Gelman A, Rubin DB. Inference from iterative simulation using multiple sequences. Stat Sci. 1992;7:457–511.
34. Plummer M, Best N, Cowles K, Vines K. CODA: Convergence diagnosis and output analysis for MCMC. R News. 2006;6:7–11.
35. Fields Development Team. Fields: Tools for Spatial Data. 2006. National Center for Atmospheric Research. Boulder, CO.
36. Garcia-Closas M, Rothman N, Lubin J. Misclassification in case-control studies of gene–environment interactions: assessment of bias and sample size. Cancer Epidemiol Biomarkers Prev. 1999;8:1043–1050. [PubMed]
37. Hayes RB, Sigurdson A, Moore L, Peters U, Huang WY, et al. Methods for etiologic and early marker investigations in the PLCO trial. Mutat Res. 2005;592:147–154. [PubMed]
38. Green P. Reversible jump Markov chain Monte Carlo computation and Bayesian model determination. Biometrika. 1995;82:711–732.
39. Seaman SR, Richardson S. Equivalence of prospective and restrospective models in the Bayesian analysis of case-control studies. Biometrika. 2004;91:15–25.
40. Costain DA. Bayesian partitioning for modeling and mapping spatial case-control data. Biometrics. 2009;65:1123–1132. [PubMed]
41. Raftery AE, Richardson S. Model selection for generalized linear models via GLIB, with application to epidemiology. In: Berry DA, Stangl DK, editors. Bayesian Biostatistics. New York: Marcel Dekker; 1996. pp. 321–354.
42. Tang W, Wu X, Jiang R, Li Y. Epistatic module detection for case-control studies: a Bayesian model with a Gibbs sampling strategy. PLoS Genet. 2009;5:e1000464. doi: 10.1371/journal.pgen.1000464. [PMC free article] [PubMed]
43. Chatterjee N, Kalaylioglu Z, Moslehi R, Peters U, Wacholder S. Powerful multilocus tests of genetic association in the presence of gene-gene and gene–environment interactions. Am J Hum Genet. 2006;79:1002–1016. [PubMed]
Articles from PLoS Genetics are provided here courtesy of
Public Library of Science