We have described a fully Bayesian mixture model incorporating previous knowledge into genetic association studies. In doing so, we have outlined the use of the model for incorporation of previous linkage information into genetic association studies. With genetic effect size for complex diseases being relatively small, the use of all available information is crucial to untangling the genetic architecture of complex disease.
As with any analysis, care is needed to determine whether to include information from previous studies, especially association studies, in that both study populations need to be from similar ethnic backgrounds and have similar phenotype definitions. If the studies are not comparable in these ways, the applicability of the previous knowledge will be compromised. Likewise, if the study providing the previous knowledge is under powered, this knowledge may be weak, non-informative, or incorrect and may dilute or diminish a “true” signal. Specification of the appropriate strength of the previous information can be accomplished through the specification of the T parameter in the Dirichlet distribution. If the previous study is well powered, a larger value of T can be set, whereas for a small previous study, the value of T would be correspondingly smaller.
This Bayesian model can be extended to incorporate previous knowledge from genetic association studies (from collaborations or dbGaP) or other biologic information (e.g., known functional polymorphism) to inform current association studies. When results from previous genetic association studies are available, a prior distribution based on the effect size, as opposed to the p-value, is recommended. The method is also flexible, in that a quantitative trait or phenotype could be modeled with a Gaussian distribution (as opposed to the Bernoulli distribution for a binary trait). In addition to flexibility regarding phenotype, the method can be varied by using different mixture distributions; for example, using gamma rather than normal distributions[
Lewin, et al. 2007] or modification of hyper-priors.
The association study that we utilized to demonstrate the mixture model is limited in its power since only 40 subjects were included. Well-powered, case-control genome-wide association studies of colorectal cancer within the Colon CFR and other study populations are underway. Analysis of these GWAS data, incorporating information from an appropriately-powered linkage study is of high relevance, and will serve as a critical next step.
To incorporate previous knowledge into an analysis, a Bayesian approach is the natural choice. In this approach, both the model for the data (i.e., likelihood) and the model for the previous information (i.e., prior distribution) collectively determine the posterior distribution for which statistical inferences are determined. It is clear that an informative prior will influence the inferences made from the subsequent posterior distribution. However, in a Bayesian framework there is also the ability to comprehensively assess the impact of prior distribution specification on the posterior inferences (i.e., sensitivity analysis). In essence, a Bayesian analysis incorporating previous knowledge can be thought of as a “pooled” analysis in which the prior distribution is acting like the “data” from a previous study.
In conclusion, understanding the complex relationship between genetic variation and complex disease is at the heart of “personalized medicine”. In order to increase our knowledge about the etiology of complex diseases, scientific investigation needs to be sequential, with knowledge gained from each step of the discovery process carried forward into the subsequent phases of the study. By including the wealth of knowledge that already exists for many complex diseases, we may increase our chances of unraveling the complex relationship between the human genome, environment and complex disease. The proposed Bayesian model is one tool that is available to researchers to aid in reaching the goal of “personalized medicine”.