|Home | About | Journals | Submit | Contact Us | Français|
For most associations of common single nucleotide polymorphisms (SNPs) with common diseases, the genetic model of inheritance is unknown. The authors extended and applied a Bayesian meta-analysis approach to data from 19 studies on 17 replicated associations with type 2 diabetes. For 13 SNPs, the data fitted very well to an additive model of inheritance for the diabetes risk allele; for 4 SNPs, the data were consistent with either an additive model or a dominant model; and for 2 SNPs, the data were consistent with an additive or recessive model. Results were robust to the use of different priors and after exclusion of data for which index SNPs had been examined indirectly through proxy markers. The Bayesian meta-analysis model yielded point estimates for the genetic effects that were very similar to those previously reported based on fixed- or random-effects models, but uncertainty about several of the effects was substantially larger. The authors also examined the extent of between-study heterogeneity in the genetic model and found generally small between-study deviation values for the genetic model parameter. Heterosis could not be excluded for 4 SNPs. Information on the genetic model of robustly replicated association signals derived from genome-wide association studies may be useful for predictive modeling and for designing biologic and functional experiments.
When the association between a genetic marker and a trait is evaluated in a population-based study, there is rarely a priori biologic evidence supporting a particular genetic model of inheritance for the risk allele. Investigators may present and analyze the results of genetic association studies in various ways. If the risk is the same for heterozygotes, carrying 1 copy of the high-risk allele a, as for homozygotes, then the underlying genetic model is dominant, and therefore the data are dichotomized into “carriers” versus “noncarriers.” If 2 copies of a are required for the risk to be different from the baseline risk, then the genetic model is recessive. The additive model assumes that on a log scale, the risk in carriers of 2 copies of a is double the risk in heterozygotes. Usually a strong preference for 1 genetic model is unjustified (1). Exceptions exist, as in the case of null genotypes of enzyme-coding genes, where extensive data on enzymatic activity may be available, but typically the model of inheritance used is suggested by convenience or even tradition in the research field. For example, it is common in the analysis of associations with rare susceptibility alleles to analyze data assuming a dominant model, perhaps because it is recognized that a recessive-model analysis would have negligible statistical power. Usually the rationale used in choosing 1 particular model is not discussed at all.
In theory, one could try to examine the fit of different models to the data. However, the ability to draw inferences from a single study's data is limited. A meta-analysis of many studies improves the power to demonstrate associations consistently and may also allow for a stronger exploration of how the available data fit different models of inheritance, along with obtaining a summary effect size. In this regard, a Bayesian model has been suggested; the genetic model in a meta-analysis is represented by an unknown parameter which, when estimated across studies, allows us to learn about the underlying model (2–4).
Until now, this method has been applied primarily to meta-analyses of genetic associations from the candidate gene era (3, 5). This has posed difficulties in exploiting its full potential, given the poor replication record of such associations (6). If an association is not confirmed, modeling of the best-fitting genetic model may produce a fit to the noise and errors in the data rather than the true underlying biology. However, the advent of genome-wide association studies (7) and large-scale consortia (8) has transformed the evidence on genetic associations. For several diseases, we now have robust evidence on a number of common genetic variants. Using data from associations with robust statistical support and large-scale evidence from many data sets, one can revisit the question of genetic model fit more efficiently. This can yield some useful insights into the underlying biology of the identified associations and may also be informative with regard to the best analysis plan for genome-wide association studies. In the setting of an agnostic approach, investigators often pick 1 genetic model for analyzing the data. Most often this is the additive (per-allele or codominant) model, because of statistical power considerations, and thus most associations derived from genome-wide association studies are usually presented as per-allele risks (9). It would be useful to examine whether other models might fit these data equally well or better.
Our aim in this paper was to explore the potential of Bayesian meta-analysis to inform us about the underlying genetic model for 17 single nucleotide polymorphisms (SNPs) that have robust statistical support for an association with type 2 diabetes. Type 2 diabetes is a prime paradigm in which a large number of common variants have already been identified with a successful application of large-scale collaborative research. We analyzed data from the DIAGRAM [Diabetes Genetics Replication and Meta-Analysis] consortium that incorporate, through meta-analysis, data from 3 genome-wide association studies and from replication data sets (10–14).
The field of type 2 diabetes genetics has witnessed rapid progress in the identification of robustly associated susceptibility loci over the last few years. The list of established disease-associated variants continues to grow. We examined genotype data from 19 case-control studies for 17 of these established type 2 diabetes loci (see Web Table 1, posted on the Journal’s Web site (http://aje.oxfordjournals.org/)). We obtained data generated principally through the efforts of the DIAGRAM consortium. Details on the contributing data sets can be found elsewhere (10–14). For each study, we used the raw genotype data in cases and controls. The data sets were derived from 3 genome-wide association studies (10–12) and additional replication teams. Results regarding the strength of association for each of these polymorphisms in each of the discovery genome-wide association studies are provided elsewhere (14). For the 3 genome-wide association studies, SNPs were excluded if the controls violated Hardy-Weinberg equilibrium at P< 0.000001 (P<0.0001 in the Wellcome Trust Case Control Consortium study), given the large multiplicity of analyses; the threshold for Hardy-Weinberg equilibrium testing in the replication data sets was P<0.001 (P<0.05 in the Nurses’ Health Study).
We extended a model initially suggested by Minelli et al. (2). Consider a biallelic locus, with A being the “low-risk” allele and a the allele associated with higher risk of type 2 diabetes. The association of the locus with the disease is then reflected in 2 odds ratios (ORs); choosing the homozygotes AA as a reference group, ORAa is the odds ratio for the heterozygotes and ORaa is the odds ratio for the homozygotes aa in comparison with the reference group. The underlying genetic model refers to the relation between these 2 odds ratios. In a general case, log(ORaa) = λlog(ORAa), with λ=1 for a dominant model, λ=0.5 for an additive model, and λ=0 for a recessive model for the diabetes risk allele. However, one may argue that λ may be left unspecified and let the data inform the model. Each study, however, provides only 1 genotype-specific estimate of the effect and therefore only 1 estimate of λ.
If there is little rationale assuming that the genetic model would vary across studies, a fixed-effects summary estimate of λ could be obtained from a meta-analysis. However, if there are reasons to believe that the model of inheritance might vary across studies (if, for example, the studies refer to different ethnic groups), a hierarchical random-effects model for λ can be fitted.
The meta-analysis model is outlined below. We fit it within a Bayesian framework to take advantage of its flexibility. An important asset is its ability to incorporate full uncertainty in all model parameters (including the heterogeneity parameter τ2 and the genetic model parameter λ). Eventually, it is possible to estimate the probability of each genetic model's being the true one.
The evaluated type 2 diabetes studies had a case-control design, and therefore they would be appropriately analyzed using a retrospective likelihood approach. Consider the 2 vectors of the observed distribution of the cases and controls for the 3 genotypes (AA, Aa, aa), cai = (caAAi,caAai,caaai) and coi = (coAAi,coAai,coaai), for a study i. The basic parameters to model are 2 probability vectors for the 3 genotypes, given the case or control status (15). The likelihood is multinomial for both cases and controls. In study i, the cases (denoted ca) and the controls (denoted co) relate to the parameters πica = (πAAica,πAaica,πaaica) and πico = (πAAico,πAoico,πaaico) with and through the multinomial likelihoods
As discussed above, it is not necessary to assume a particular inheritance model. We can parameterize the 2 log odds ratios as log(ORAa) = λiμi and log(ORaa) = μi.
Then the 3 genetic models can be identified as follows.
Dominant: λ=1, so that mutant homozygotes and heterozygotes have the same disease odds.
Codominant: λ=0.5, so that homozygotes have double the odds (on the log scale).
Recessive: λ=0, suggesting that heterozygotes have no higher disease odds than wild-type homozygotes.
Generally it should be reasonable to assume that risks are inherited in a similar way in different populations, so we have a common effect for λ in our analyses; that is, λi = λ, and we will refer to this as model 1, which represents our main analysis. To allow for heterogeneity in the underlying inheritance model across studies, we also examined as sensitivity analyses random effects, where random λi parameters underlie each study, and these are drawn from a common distribution, where λi may be restricted to lie between 0 and 1 (i.e., λi~N(λR,τλ2)I(0,1) (model 2)) or could be more unrestricted and thus also take negative values and values above 1 (i.e., λi~N(λU,τλ2) (model 3)). When λi is restricted to lie between 0 and 1, the risk conferred by heterozygosity is forced to range between no risk and the risk conferred by homozygosity. With more unrestricted values of λi, additional possibilities are allowed; for example, heterozygotes may have a protective effect while homozygotes have increased risk (negative values of λ), or heterozygotes may have more increased risk than homozygotes (λ>1).
The case probabilities are parameterized in terms of the log odds ratios and the probabilities in the controls:
A fixed-effects meta-analysis may be undertaken assuming μi = μ for each i or a random-effects meta-analysis assuming μi~N(μ,τ2), with τ measuring the extent of heterogeneity.
We used minimally informative normal priors centered at 0 for the location parameters μi,μ. For the probabilities πica, we used priors that approximated the Dirichlet distribution and gave equal prior probabilities of the diabetes condition to all 3 genotypes (). For the heterogeneity standard deviation, we placed a half-normal prior τ~N(0,1),τ > 0.
For the genetic model parameter λ, which is the focus of our research, we implemented 4 prior probabilities. The first 3 use the flexible Beta distribution (previously used by Minelli et al. (2)) that returns λ values between 0 and 1:
The first prior has a flat uniform shape between 0 and 1. However, when the model is recessive or dominant (i.e., λ is at the edges of the distribution), the first prior (prior a) tends to shift the parameter values toward the mean of the distribution (0.5). Therefore, the second prior, Beta(0.5, 0.5), gives higher probabilities at the upper and lower ends of the interval. This prior has the drawback that when the true model is additive, it tends to shift the estimate towards the edges of the distribution. The third prior represents a compromise between the above 2 situations; we used the third prior in the main analysis and the other 2 in sensitivity analyses.
We further introduce another discrete distribution approach. This fourth prior reflects situations in which λ is allowed to take discrete values only, those corresponding to the 3 genetic models. Therefore,
d) λ~cat(0, 0.5, 1), with corresponding probabilities pR,pC,pD, where we set all models to be equally probable and thus pR = pC = pD = 1/3.
Figure 1 presents the 4 priors.
For the 2 models in which it is assumed that there are random small differences in λi, the prior on the mean of the random-effects distribution λR for the restricted case is a Beta prior λR~Beta(0.7,0.7). For the unrestricted case, the prior is reflecting the ability to incorporate heterosis and negative λ values but it is truncated to the interval −1 to 2 (λU~N(0,1,000)I( − 1,2)), since values outside of this very wide range are not very plausible. For the genetic model heterogeneity standard deviation τλ, we placed a half-normal prior: τλ~N(0,1),τλ ≥ 0.
We then estimated the posterior distribution of λ. For the first 3 priors, we obtained the median value and its 95% credibility interval and also evaluated the impact on the estimated odds ratios and heterogeneity parameter (τ). For prior d, the posterior distribution shows directly the probability of each model, as the probability that λ takes each 1 of the 3 alternative values.
Table 1 presents schematically all of the alternative models and their combinations with the priors.
Eight SNPs have been approximated in some studies by genotyping a nearby SNP in high linkage disequilibrium (Web Table 1). Since the suggested model might be affected by the use of proxies, we performed sensitivity analysis by including only studies in which the investigators had genotyped the main SNP of interest.
In Table 2, we present the odds ratios for heterozygotes and homozygotes and the median values and 95% credibility intervals for λ using model 1 and prior c (i.e., λ ~ Beta(0.7, 0.7)); we also show the discrete probability of each of the 3 models of inheritance based on prior d. All odds ratios refer to the high-risk allele for each SNP, and each meta-analysis comprises 19 studies, unless stated otherwise. Of the 312 data sets of controls, 30 had P values less than 0.05 in Hardy-Weinberg equilibrium testing; only 4 had P < 0.001 in such testing.
In the majority of cases, the suggested most likely model was the additive model. For 4 SNPs, the underlying model seemed to lie between the additive model and the dominant model for the risk allele (at the NOTCH2, CDC123/CAMK1D, TSPAN8/LGR5, and TCF2 loci), with corresponding probabilities supporting the dominant model being 25%, 15%, 11%, and 31%. These are the 4 SNPs for which the estimated ORaa’s for homozygotes were the weakest, ranging from 1.10 to 1.16. For 2 SNPs (at the THADA and WFS1 loci), the model seemed to lie between the additive and recessive models for the risk allele, with probabilities supporting the recessive model being 39% and 17%, respectively. In both cases, the ORAa for heterozygotes was weak (1.07 and 1.05, respectively).
For all 17 SNPs, at least 1 of either the dominant or recessive models could be excluded. Figure 2 shows representative posterior distributions according to prior c.
The 3 different priors had some impact on the median λ estimate for model 1 but not on the overall conclusions. In cases where there was high confidence about the underlying model (more than 95% probability for a specific model according to prior d)—for example, the additive models suggested for PPARG, ADAMTS9, IGF2BP2, CDKAL1, JAZF1, SLC30A8, CDKN2A/B, HHEX/IDE, TCF7L2, KCNJ11, and FTO—the absolute differences for the 3 Beta distribution priors were no more than 0.01 in the median posterior λ, no more than 0.09 in the 2.5% credibility bound, and no more than 2% in the 97.5% credibility bound. For the other 6 SNPs, the absolute differences in the median λ went up to 0.08, and the respective figures for the upper and lower 95% credibility bounds were 0.05 and 0.05 (see Web Table 2 (http://aje.oxfordjournals.org/)).
There was also no material variation in the odds ratio point estimates or the heterogeneity standard deviation τ with the different priors (Web Table 2). Note that Bayesian estimation of the effects incorporates full uncertainty in the estimates, including the uncertainty in the heterogeneity variance τ2, and therefore gives wider intervals than the random-effects model fitted with frequentist approaches. In the main analysis (using prior c), for 9 genetic variants (at the IGF2BP2, CDKAL1, JAZF1, SLC30A8, CDKN2A/B, CDC123/CAMK1D, TCF7L2, KCNJ11, and FTO loci), the lower bound of the 95% credibility interval for the effect of heterozygosity was higher than 1.05, and this also applied to the effect for homozygotes. The 95% credibility intervals for the effect of NOTCH2 included the null value for both heterozygotes and homozygotes. For the remaining SNPs (in/near THADA, PPARG, ADAMTS9, HHEX/IDE, TSPAN8/LGR5, and TCF2), the lower bound of the 95% credibility interval for the effect of heterozygosity was rather low (lower than 1.05), while the effect of homozygosity was giving lower bounds up to 1.15.
The variation in these results based on the alternative priors was not substantial (Web Table 2).
Table 3 shows sensitivity analysis results for the 8 index SNPs for which proxies have been used in some studies. In 6 cases (the CDKAL1, JAZF1, HHEX/IDE, TCF7L2, TCF2, and FTO loci), exclusion of proxies did not seem to result in material changes regarding the genetic model parameter or the probability for each genetic model, although the uncertainty in all parameters increased in some cases because of the reduced total sample size. For NOTCH2, the probability supporting the dominant model dropped from 25% to 11%. Confidence for the additive model for IGF2BP2 was slightly challenged, giving 11% probability for a dominant model.
We fitted the model allowing the study-specific genetic model parameters λi to vary randomly with values restricted between 0 and 1. Table 4 gives results for the median λR of the distribution and the parameter τλ, which shows the magnitude of the genetic-model heterogeneity fitted using prior c. The median heterogeneity standard deviation was no higher than 0.14. For NOTCH2, the credibility interval was shifted downwards (0.32, 0.99 became 0.23, 0.78), and the recessive model could be excluded for ADAMTS9. The CDC123/CAMK1D and TSPAN8/LGR5 upper limits for median λ were lower (0.98 and 0.97 became 0.86 and 0.84, respectively). For all other SNPs, no material changes in the median λ value were observed, and the credibility intervals were comparable to those from the fixed-effects model. Consequently, no major changes regarding the underlying model or the estimated odds ratios were observed.
Table 5 shows results from a random-effects model in which we also allowed λ to vary randomly. Important changes were observed for the NOTCH2 locus, which now covered possible values for the median λ between recessive, additive, and dominant models and included the heterosis possibility (see Web Figure 1 (http://aje.oxfordjournals.org/)). THADA gave a 95% credibility interval that covered the majority of the allowed negative values, with a considerable shift of the median from 0.36 to 0.10. ADAMTS9 included the recessive model, whereas with an unrestricted model the lowest 95% credibility bound was at 0.26 (Web Figure 1). For 3 further loci (CDC123/CAMK1D, TSPAN8/LGR5, and TCF2), the scenario of heterosis could not be excluded. However, the unrestricted analysis with model 3 seemed to give results consistent with those from the restricted model (model 2) and the fixed-effects model (model 1) for the IGF2BP2, CDKAL1, JAZF1, SLC30A8, CDKN2A/B, KCNJ11, and FTO loci.
We applied and extended a genetic model-free Bayesian approach to investigate the fit of type 2 diabetes associations to various genetic models of inheritance. Regardless of the prior distribution used, our analyses found that most of the common genetic variants that show robust associations with type 2 diabetes risk fitted best to an additive model. However, several exceptions existed, where either recessive or dominant models for the risk allele also had substantial support as alternative options, besides the additive model. At least 1, if not 2, of the 3 main genetic models could be excluded with considerable certainty for all 17 associations.
The 17 SNPs that we analyzed all had considerable statistical support, with P values less than 2 × 10−7 in joint analyses (by fixed-effects models and, for most, also by random-effects models) (13, 14). They also passed several quality checks, including Hardy-Weinberg equilibrium testing, although modest deviations from such equilibrium were still possible. Overall, the credibility of these associations was rather high.
The choice of genetic model in genome-wide association studies remains open and arbitrary, but most investigators seem to adopt an additive (per-allele) analysis. Exclusively recessive-fit and exclusively dominant-fit associations may be discovered if a more comprehensive analytical approach is followed. Studies examining variants in linkage disequilibrium with the causal variant (and not the causal variant itself) may have higher power to detect an association under the additive model. Therefore, the established type 2 diabetes variants we are investigating are likely to have a higher relative representation of such loci.
We also evaluated the possibility of between-study heterogeneity in the genetic model. For most SNPs, the median between-study standard deviation was small or modest, and its consideration did not much change the overall inferences about the most likely genetic model. An unrestricted analysis showed that heterosis is not common but still remains a plausible scenario, since it could not be excluded for 4 of the 17 SNPs. Given that genome-wide association studies use target SNPs that are unlikely to be the true culprits, different linkage disequilibrium may introduce such heterogeneity in the genetic model across different populations. This is not very likely in the examined data, since all study populations were of Caucasian descent. However, this might become a more serious issue if populations of different ancestry were to be examined.
A limitation of a genetic model-free Bayesian model is that it is driven by the data at hand in identifying the most likely genetic model. The ability to extrapolate to other data and populations is an open challenge. Moreover, the proposed modeling should be used primarily for associations that are already supported by a substantial body of evidence, based on several studies and conventional meta-analysis thereof. Application of these methods to data from associations that are likely to represent false-positives may result in overfitting to noise signals. In the absence of robust support for the presence of an association, these analyses should be recognized as exploratory.
Knowledge of the best-fitting genetic model may be important in optimizing the use of these markers for predictive purposes. At the current stage, genetic markers in type 2 diabetes explain only about 2.5% of the risk variance and would result in a predictive area under the receiver operating characteristic curve (AUC) of only 0.60, while traditional predictors (body mass index, sex, and age) already result in an AUC of 0.78 (17). With many markers accrued, proper modeling may be potentially useful to increase the predictive ability. However, the Bayesian model that we used further highlights the challenges and difficulties of using this information for predictive purposes. When we considered the full scale of uncertainty in parameters, the 95% credibility intervals of the odds ratios were considerably large. For NOTCH2, these intervals even crossed the null. This means that the effects of these genetic markers in some populations may be very small or even nonexistent. This adds an extra note of caution to the possibility of predictive testing in the general population based on this information (18).
Although some of the established type 2 diabetes susceptibility loci (like PPARG and KCNJ11) have been known for several years, the field has not progressed to unequivocal identification of the truly causal variants. Consequently, statistical inference regarding the true genetic model under which these loci act has been difficult. In addition, there is a paucity of biologic data that might help address the genetic model question. Identification of the best-fitting model through Bayesian meta-analysis may be helpful in suggesting how biologic and functional experiments should be set up and what model should be used in them.
Author affiliations: Clinical and Molecular Epidemiology Unit and Clinical Trials and Evidence-Based Medicine Unit, Department of Hygiene and Epidemiology, School of Medicine, University of Ioannina, Ioannina, Greece (Georgia Salanti, John P. A. Ioannidis); Wellcome Trust Centre for Human Genetics, University of Oxford, Oxford, United Kingdom (Lorraine Southam, Eleftheria Zeggini, Mark I. McCarthy, Andrew Morris); Institute of Musculoskeletal Sciences, Botnar Research Centre, Nuffield Orthopaedic Centre, University of Oxford, Oxford, United Kingdom (Lorraine Southam); Broad Institute of Harvard and Massachusetts Institute of Technology, Cambridge, Massachusetts (David Altshuler, Kristin Ardlie, Benjamin F. Voight); Center for Human Genetic Research, Massachusetts General Hospital, Boston, Massachusetts (David Altshuler, Benjamin F. Voight); Diabetes Unit, Massachusetts General Hospital, Boston, Massachusetts (David Altshuler); Department of Molecular Biology, Massachusetts General Hospital, Boston, Massachusetts (David Altshuler); Department of Medicine, Harvard Medical School, Boston, Massachusetts (David Altshuler, Benjamin F. Voight); Department of Genetics, Harvard Medical School, Boston, Massachusetts (David Altshuler); Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Cambridge, United Kingdom (Inês Barroso, Felicity Payne, Eleftheria Zeggini); Department of Biostatistics and Center for Statistical Genetics, School of Public Health, University of Michigan, Ann Arbor, Michigan (Michael Boehnke, Laura J. Scott); Department of Nutrition, Harvard School of Public Health, Boston, Massachusetts (Marilyn C. Cornelis, Frank B. Hu); Genetics of Complex Traits, Peninsula Medical School, Exeter, United Kingdom (Timothy M. Frayling); Institute of Epidemiology, German Research Center for Environmental Health, Neuherberg, Germany (Harald Grallert, Thomas Illig); Steno Diabetes Center, Gentofte, Denmark (Niels Grarup, Torben Hansen); Diabetes and Endocrinology Research Unit, Department of Clinical Sciences, Lund University, Malmö, Sweden (Leif Groop, Valeriya Lyssenko); Faculty of Health Sciences, University of Southern Denmark, Odense, Denmark (Torben Hansen); Molecular Genetics Research Group, Peninsula Medical School, Exeter, United Kingdom (Andrew T. Hattersley); HUNT Research Center, Faculty of Medicine, Norwegian University of Science and Technology, Trondheim, Norway (Kristian Hveem, Carl G. P. Platou); Department of Medicine, Levanger Hospital, The Nord-Trøndelag Health Trust, Levanger, Norway (Kristian Hveem, Carl G. P. Platou); Department of Medicine, University of Kuopio and Kuopio University Hospital, Kuopio, Finland (Johanna Kuusisto); MRC Epidemiology Unit, Institute of Metabolic Sciences, Addenbrooke's Hospital, Cambridge, United Kingdom (Markku Laakso, Nicholas J. Wareham, Claudia Langenberg); Oxford Centre for Diabetes, Endocrinology and Metabolism, Churchill Hospital, Oxford, United Kingdom (Mark I. McCarthy); Diabetes Research Centre, Biomedical Research Institute, University of Dundee, Dundee, United Kingdom (Andrew D. Morris); Pharmacogenetics Research Centre, Biomedical Research Institute, University of Dundee, Dundee, United Kingdom (Colin N. A. Palmer); and Center for Genetic Epidemiology and Modeling, Institute for Clinical Research and Health Policy Studies, Department of Medicine, School of Medicine, Tufts University, Boston, Massachusetts (John P. A. Ioannidis).
Support for this project was provided through the Tufts Clinical and Translational Science Institute under funding from the National Institutes of Health/National Center for Research Resources (grant UL1 RR025752). Dr. Eleftheria Zeggini was supported by the Wellcome Trust (grant WT088885/Z/09/Z).
Drs. Eleftheria Zeggini and John P. A. Ioannidis contributed equally to this article.
Points of view or opinions presented in this paper are those of the authors and do not necessarily represent the official position or policies of the Tufts Clinical and Translational Science Institute.
Conflict of interest: none declared.