|Home | About | Journals | Submit | Contact Us | Français|
For genomewide association studies with family-based designs, we propose a Bayesian approach. We show that standard TDT/FBAT statistics can naturally be implemented in a Bayesian framework. We construct a Bayes factor conditional on the offspring phenotype and parental genotype data and then use the data we conditioned on to inform the prior odds for each marker. In the construction of the prior odds, the evidence for association for each single marker is obtained at the population-level by estimating the genetic effect size in the conditional mean model. Since such genetic effect size estimates are statistically independent of the effect size estimation within the families, the actual data set can inform the construction of the prior odds without any statistical penalty. In contrast to Bayesian approaches that have recently been proposed for genomewide association studies, our approach does not require assumptions about the genetic effect size; this makes the proposed method entirely data-driven. The power of the approach was assessed through simulation. We then applied the approach to a genomewide association scan to search for associations between single nucleotide polymorphisms and body mass index in the Childhood Asthma Management Program data.
Genetic association studies can be dichotomized into those that have population-based designs and those that have family-based designs. Population-based studies test for association between a phenotype and a genotype in a sample of unrelated individuals. These studies are susceptible to population stratification, i.e., systematic differences in allele frequencies between subpopulations in a population not due to a causal association with the phenotype of interest. Family-based studies test for association between a phenotype and a genotype in a sample of related individuals by assessing whether individuals with a given phenotype have a higher transmission ratio than would be expected by chance given their parents' (or other family members') genotypes and Mendel's laws. It is this ability to compare a person's genotype to his expected genotype based on Mendel's laws that makes family-based tests robust to population stratification. Family-based studies can also assess the evidence of association by examining allele frequencies across families in a somewhat similar fashion to what is done in population-based studies (although this evidence is not inherently robust to population stratification).
The use of the two independent sources of information available in family data is what makes the family-based association test (FBAT) screening method proposed by Steen et al. (2005) so powerful. If we let x denote the offspring genotypes, y denote the offspring phenotypes, and P denote the parents' genotypes, the following relationship holds:
P(x|y, P) represents the information used in the FBAT statistic; that is, the FBAT statistic conditions on the offspring phenotypes, y, and parents' genotypes, P, and only the offspring genotypes, x, are considered random variables. P(y, P) represents the information used in Steen et al. (2005)'s screening approach, which first determines which SNPs will have the highest power to be detected if a true association exists in the given study and then applies FBAT only to those SNPs. In determining which SNPs have the highest power, only the offspring phenotypes, y, and parental genotypes, P are used. By applying the FBAT to only the SNPs in which a true association would most likely be detectable, the number of comparisons that must be adjusted for is reduced. The statistical validity of the screening method is due to the relationship in the formula above.
We propose a novel approach for family-based genetic association studies that also capitalizes on this relationship. However, instead of using one portion of the family data (i.e., x|y, P) to create a statistical test and the other independent portion (i.e., y, P) to screen SNPs, we take a Bayesian approach: we use one portion to calculate a Bayes factor and the other to calculate the prior odds of the null hypothesis. Our method capitalizes on both the between and within family information in a more intuitive and flexible manner than frequentist approaches. We construct a Bayes factor conditional on parental genotypes and offspring phenotypes and then use the information conditioned on to inform the prior odds of association for each marker. In this way we are able to combine evidence from the over-transmission of alleles within families with evidence based on the allele frequencies across the families into a posterior odds of no association.
An intuitive measure of the evidence for genetic association is the posterior odds of no association: . We will denote the hypothesis that a given marker is not associated with a phenotype by H0. The posterior odds of H0 can be written as
When there are only two hypotheses under consideration, we can rewrite the above equation as
We assume throughout that the alternative hypothesis of interest, H1, is an additive genetic effect; however the Bayes factor and posterior odds described below can easily be modified to test other hypotheses.
If we now consider family-based association studies with complete genotype data for trios (two parents and an offspring) and a continuous phenotype for the offspring, then the data consists of parental genotypes, P, offspring genotypes, x, and offspring phenotypes, y. In the derivation of our Bayes factor we borrow from the FBAT statistic the concept of conditioning on the parental genotypes, P, and offspring phenotypes, y, so that the formula for the posterior odds of H0 becomes
By conditioning on y and P when we assess the weight of evidence for over-transmission of alleles in the Bayes factor, we are able to use the between-family information available in y and P to gain further evidence for or against association in the conditional prior odds. Two main goals of genomewide association scans can be addressed by a measure of the posterior odds of H0, no association: ranking markers to decide which to follow up and determining which findings to report as noteworthy, or significant. In the following three subsections we discuss the conditional Bayes factor, the conditional prior odds, and how to use these measures to infer the noteworthiness of individual markers.
We assume our data consists of independent trios (two parents and an offspring in each family) and H0 and H1 correspond to the hyotheses of no genetic effect and an additive genetic effect, respectively. We assume the ith individual's phenotype, yi, is independently distributed as yi ~ N(μ + axi, σ2), where μ is the overall mean of the quantitative trait and a is the genetic effect. Since we assume an additive genetic effect, we code the individual's genotype. xi as the number of copies of an allele, i.e., 0, 1, or 2. The Bayes factor using the offspring genotypes, x, conditional on the offspring phenotypes, y, and parental genotypes, P is
where the sum is over all possible offspring genotypes that could have occurred in family i given the parental genotypes, Pi. (See Appendix 6.1 for derivation.)
A fully Bayesian approach requires specification of the conditional prior distribution, P(a, μ, σ | y, P). However, the multiple integration needed to calculate this Bayes factor may be computationally prohibitive. The integrals can be evaluated at the maximum likelihood estimates derived by regressing y on x, so that the Bayes factor is approximated as
where â, , and are the maximum likelihood estimates.
We also assessed the power of fully specifying the priors based on only the parents' genotypes and offsprings' phenotypes, but found this approach to be less powerful and more computationally intensive. In this case, we derived the conditional prior for a, μ and σ analytically by fitting the conditional mean model (Lange et al., 2003),
and specifying an unconditional prior on a, μ and σ, e.g.,
In the conditional mean model, E(xi|Pi) is determined using Mendel's Laws. This allows us to gain additional evidence for or against association from y and P without using the true offspring genotypes, x. Since substituting in the maximum likelihood estimates for a, μ and σ was more powerful than this approach, we only discuss the former method in the remainder of the manuscript. In the next section we describe how we used the conditional mean model to estimate the conditional prior odds.
The conditional prior odds are used to weigh the evidence for association using only the offspring phenotypes, y, and the parental genotypes, P (i.e., the offspring genotypes, x, are not used).
To get an estimate of the genetic effect size, a, without using the observed offspring genotypes we use the conditional mean model (Lange et al., 2003):
The conditional mean model uses Mendel's Laws to determine the expected offspring genotypes given only the parents' genotypes. This enables estimation of the genetic effect size, a, using only the parents' genotypes and the offsprings' phenotypes. We can then use the estimated effect size, ã, and its standard error, , to summarize the evidence of association contained in y and P. One way to do this is to consider approximating the distribution of the genetic effect size, a, by . Then we can think of P(a < 0) as a proxy for P(H0|y, P) and calculate the prior odds as
We cannot use P(a = 0) since this quantity is always equal to zero. Instead, we use the probability a is less than or equal to zero if ã > 0 or the probability a is greater than or equal to zero if ã > 0; this is equivalent to taking P(a > 0) when we assume that a is normally distributed with mean equal to the absolute value of ã. By taking the absolute value of a, we are not penalized if it is wrong due to population stratification. This is an admittedly ad hoc way to derive the prior odds, but it does utilize the information in y and P. Furthermore, multiplying the conditional Bayes factor by this prior odds yields greater power in simulation (for an individual test with threshold based on null simulations) than the Bayes factor alone (Supplementary Table VI).
SNPs can easily be ranked according to the posterior odds. Unlike some recently proposed Bayesian methods (Wakefield, 2008a; Consortium, 2007), ranking SNPs by our Bayes factor alone will not yield the same rankings as the posterior odds. This is because the conditional prior odds are not constant for all markers: we use the parental genotypes and offspring phenotypes to inform the prior odds for each marker individually. With our method, it is more powerful to rank the SNPs by posterior odds since doing so uses more of the information available in the data.
One caveat to including the prior odds is the increased susceptibility to detecting association due to population stratification. The conditional Bayes factor may be affected by population stratification if the maximum likelihood estimates are substituted for a, μ, and σ, but it primarily weighs the evidence for association in the within-family information. (The transmission of alleles within a family is completely robust to population stratification.) On the other hand, the prior odds rely entirely on population differences in allele frequencies and are therefore as susceptible as any unadjusted population-based method. Because of this, we recommend that SNPs be followed up only if the posterior odds of the null hypothesis is less than the prior odds of the null hypothesis, i.e., only if the conditional Bayes factor is less than one.
Inferring which SNPs to report as noteworthy or significant is a more challenging matter than just ranking them. One issue that must be taken into account is that P(H0|y, P) ≤ 0.5 which is not likely to be a realistic reflection of one's true beliefs in a genomewide association scan; i.e., it is not likely that at least half of the markers will be associated with a given phenotype. Also, substituting the maximum likelihood estimates for the priors in the Bayes factor is likely to favor the alternative hypothesis since the Bayes factor is evaluated at only the best estimate under the alternative. It is relatively straightforward to control the type one error for a single test: use the sample size, allele frequencies, and variance of the phenotype to simulate the data under the null hypothesis of no association and choose as a the threshold the 95th percentile of the simulated posterior odds. Alternatively, one could permute the data to obtain an empirical p value. Bayesian decision theory has been suggested as a tool for controlling the false discovery rate when using Bayesian methods to assess many markers individually in genomewide association scans (Wakefield, 2008b). Wakefield showed that the expected posterior cost is minimized when the posterior odds is less than the cost of a false nondiscovery, CFND, divided by the cost of a false discovery, CFD: i.e., when
This approach can be easily implemented as long as one is careful to weight the conditional prior odds to truly reflect one's beliefs.
Simulations were performed to assess power. We simulated samples of 100, 500, or 1000 trios, each consisting of two parents and an offspring. All probands and parents were assumed to be accurately genotyped. Parental genotypes were drawn from a binomial distribution with the probability of success equal to the allele frequency. Offspring genotypes were generated based on Mendelian transmission. A continuous trait was generated with a standard normal distribution. Bayes factors were calculated for varying combinations of locus-specific heritability of the trait and allele frequency (0.05, 0.1 or 0.2). For each set of conditions, 10,000 iterations were run. All simulations were performed in R version 2.6.1.
Summary statistics for conditional Bayes factors and posterior odds (i.e., conditional Bayes factor times prior odds) simulated assuming no genetic effect are shown in Supplementary Tables II and III. The null simulations were used to determine a threshold for each allele frequency that preserves a type one error rate of 0.05. Thresholds were calculated as the 95th percentile of the null distribution. We examined the distributions of Bayes factors and posterior odds generated under varying allele frequencies and heritabilities and used the threshold values to determine power (Supplementary Tables IV and V). Both the conditional Bayes factor alone and the resulting posterior odds were more powerful than the FBAT statistic (Figure 1 and Supplementary Table VI). Figure 1 also shows the power when maximum likelihood estimates substituted into the Bayes factor are obtained by regressing y on x (as opposed to the conditional mean model in which y is regressed on E(x|P)). This did not yield as great a power as using the conditional mean model maximum likelihood estimates. As one would expect, power increased as the amount of heritability explained by the SNP of interest increased. The increase in power was observed across varying heritabilities and allele frequencies. While in some scenarios the improvement was substantial, in others it was slight.
We applied these methods to the Childhood Asthma Management Program dataset (199, 1999). We assessed the evidence for association between roughly 500,000 SNPs and baseline body mass index z-scores (BMIZ) (Kuczmarski et al., 2000). The analysis included 389 genotyped trios (two parents, one offspring). All SNPs were individually tested for association. We assumed an additive genetic model. No further adjustments were used since neither age nor sex was significantly associated with BMIZ. Empirical p values for the posterior odds were obtained by permuting the phenotypes repeatedly. Table I shows the twenty SNPs with Bayes factor less than one that had the highest posterior odds of the alternative hypothesis. Since the Bayes factor is
a Bayes factor of less than one indicates P(data|H0) < P(data|H1). By only considering the posterior odds for SNPs that have a Bayes factor of less than one, we avoid calling SNPs significant when there is no evidence of association within families. A low posterior odds of no association with a Bayes factor greater than one is likely to result from population stratification.
The use of Bayes factors has been advocated as a powerful way to gauge the evidence for genetic association in population-based genomewide association scans (Wakefield, 2008a; Consortium, 2007). Bayes factors are suitable for both ranking SNPs and for calibrating inferences to decide which SNPs to consider noteworthy (Wakefield, 2008b). Wakefield (2008a) proposed an asymptotic Bayes factor approach for population-based studies that does not require integration and has suggested possible priors that can be used in conjunction with it. As Wakefield mentions, there are two aspects of Bayes factors that have prevented their more widespread use: the multidimensional integrals required to calculate them can be computationally unattractive for large genomewide association scans and specification of priors leads to much controversy. Our approach makes no assumptions about the distribution of alleles or the genetic effect size. Furthermore, our method circumvents both the need to compute multidimensional integrals and the need to subjectively specify a prior.
The Bayes factor approach to family-based association testing presented here is a flexible and intuitive way to utilize both the within- and between-family information. Because the Bayes factor conditions on the parental genotypes and offspring phenotypes, we can use this portion of the data to inform the prior odds. The Bayes factor uses within family information to weigh the evidence for association at a given SNP. A main advantage of using a Bayesian framework is the flexibility it provides. The likelihood can be chosen to model any number of study designs. Additional information regarding individual SNPs can easily be incorporated into the prior as well. Furthermore, this approach does not require distributional assumptions about allele frequencies.
There are many ways this method can be extended and further characterized. The Bayes factor approach could be adapted for use with extended pedigrees, missing data, or gene-environment interactions. Biological information pertaining to individual markers could easily be used to weight the prior odds. To potentially increase the power of our novel method, information from multiple SNPs could be incorporated, or one might consider doing a joint model of all SNPs to eliminate multiple comparisons.
We especially thank Chris Paciorek for his assistance with this project. This work has been supported by the National Institutes of Health through grants T32 MH17119 and T32 HL07427-26. We thank all subjects for their ongoing participation in the CAMP study. We acknowledge the CAMP investigators and research team, supported by NHLBI, for collecting CAMP Genetic Ancillary Study data. All work on data collected from the CAMP Genetic Ancillary Study was conducted at the Channing Laboratory of the Brigham and Women's Hospital under appropriate CAMP policies and human subject's protections. CAMP is supported by U01 HL075419, U01 HL65899, P01 HL083069, and T32 HL07427 from the National Heart, Lung and Blood Institute, National Institutes of Health.
The Bayes factor for the offspring genotypes, x, conditional on the offspring phenotypes, y, and parental genotypes, P, is defined as
where H0 and H1 correspond to no genetic effect and an additive genetic effect, respectively. We assume the phenotypes, y, are independently distributed as yi ~ N(μ + axi, σ2). Integrating out the genetic effect, a, yields
Henceforth it is assumed that we are conditioning on H1 and H0 and we drop the notation accordingly. Like the genetic effect size, a, the mean and variance of y is unknown, so we integrate over these parameters as well.
because offspring genotype does not depend on the phenotype when the effect size is zero. Applying the definition of conditional probability,
Since the trios are assumed to be independent, it follows that
where the sum is over all the possible offspring genotypes that could have occurred in family i given the parental genotypes, Pi. Then,
In the absence of phenotype information, y, offspring genotypes depend only on parental genotypes, P. Thus,
Since we assume the offspring phenotypes, y, depend on the parental genotypes, P, only through the offspring genotypes, x, it follows that