|Home | About | Journals | Submit | Contact Us | Français|
The possible evidence for association comprises three types of information: differences between cases and controls in allele frequencies, in parameters for Hardy Weinberg disequilibrium (HWD), and in parameters for linkage disequilibrium (LD). LD between marker and disease alleles results in a difference in at least one of the three types of parameters [Won and Elston, 2008]. However, the parameters for LD require knowledge about phase, which is usually unknown, making the LD contrast test without modification infeasible in practice. Methods for handling phase uncertainty are: (1) the most probable haplotype pair for each individual can be considered as the true phase; (2) a weighted average of haplotypes can be used; (3) we can consider the composite LD, which does not require any information about phase. We compare these methods to handle phase uncertainty in terms of validity and efficiency, and the effect on them of HWD in the population, at the same time confirming results for the three types of information. When the LD between markers is high, the LD contrast test that uses a weighted average of haplotypes or the most probable haplotypes to calculate the LD is recommended, but otherwise the LD contrast test that uses the composite LD is recommended. We conclude that, even though the difference in allele frequencies is usually the most informative test except in the case of a recessive disease, the LD contrast test can be more powerful if the markers are dense enough.
The goal of population association studies is to identify genetic polymorphisms that distinguish among individuals with different disease status, and it has been shown that an association between marker and disease alleles can be detected as a result of gametic phase disequilibrium for linked loci, which is what we shall mean here by linkage disequilibrium (LD). Thus, today LD can play a key role in genetic epidemiology and the advent of SNPs has enabled association studies at the genome-wide level. However, notwithstanding their efficiency, the huge number of SNPs has generated the problem of multiple testing and statistical methods to combat this issue have been investigated. SNP reduction, sometimes called self-replication [Van Steen et al., 2005; Zheng et al., 2007] is a possible solution to this problem when we have two independent statistics that can be used to test the same biological hypothesis. For example, suppose we have 500,000 equally-spaced SNPs in a genome-wide association (GWA) study, i.e. SNPs about 6Kb apart, then correction for multiple tests would require a Bonferroni-corrected p-value of 0.05/500,000 for genome-wide significance at the 0.05 level, which is difficult to meet in practice. However, suppose we have two independent statistics (t1 and t2) and their p-values are u1 and u2. Then, if 1% of the SNPs are selected by screening based on either of the two independent statistics, and only the selected SNPs are used for a testing phase with the other statistic, we now only need a Bonferroni-corrected p-value of 0.05/5,000 at this testing phase to have genome-wide significance at the 0.05 level. This self-replication can be reformulated in terms of combining p-values. The rejection region for t1 (or t2), in this example of self-replication is asymptotically u1<0.05/5,000 with u2<0.01 (or u2<0.05/5,000 with u1<0.01). We can also use Fisher's method or Liptak's method [Liptak, 1958] for combining p-values (see Figure 1 for their rejection regions). As a result, better approaches than self-replication can be attained as long as we use the optimal method for combining u1 and u2; and we have shown that the most powerful method can be obtained under certain circumstances even though no uniformly most powerful test is possible [Won et al., 2008].
It has been shown that LD between marker and disease alleles can result in differences of the following parameters between cases and controls: allele frequencies, the parameters for Hardy-Weinberg disequilibrium (HWD) and the parameters for LD between markers [Won and Elston, 2008]. The fact that the phase-known genotype frequencies can be parameterized in terms of these three parameters shows they can each be used for association studies. In principle, there is evidence for association if any one of the parameters is different between cases and controls provided certain conditions - such as no genotyping error – hold, and the three types of information can have different levels of power according to the disease mode of inheritance and LD structure. However, even though these results provide intuition about an optimal strategy for analysis, their application to a practical situation is restricted because the usual LD contrast test requires information about phase, which is often unknown. Three ways of overcoming the phase uncertainty have been considered. First, the most probable haplotype pair for each individual can be considered as the true phase. Second, a weighted average of haplotypes can be used. Third, we can use the composite LD [Wang et al., 2007; Zaykin, 2004; Zaykin et al., 2006], which does not require any information about phase. The validity and efficiency of these three approaches have not yet been investigated. We therefore compare these approaches to handle phase uncertainty in terms of their type I error and power according to disease mode of inheritance and SNP density. At the same time, we investigate the effect of HWD in the population on the analysis. Our results show that, whereas the composite LD always preserves the type I error with reduced power, the methods that use the most probable haplotype or weighted haplotypes for each individual can have better power with type I error preserved provided dense markers are used. We further show that, when the markers are dense, the LD contrast test can be more powerful than a test based on allele frequency differences.
Let Xij and Yij be random variables for marker loci A and B defined by
where j = 1 (2) indicates a maternal (paternal) haplotype. For given marker loci A and B, let pAk, pAkAk′, pAkBk, pAkBkAk′Bk′, and pAk,Bk′ be respectively the frequencies of allele Ak , genotype AkAk′, haplotype AkBk, phase-unknown genotype AkBkAk'Bk′, phase-known genotype AkBk/Ak'Bk′, and the joint frequencies of alleles Ak and Bk′ of two different gametes in the population, where, for the SNPs that are used, there are usually only two alleles, A1 (B1) and A2 (B2). Also, for locus A, let pAk|case and pAk|control be the frequencies of allele Ak in the case group and control group, respectively, and let pAkAk′|case(pAkAk′|control) and pAkBk′|case(pAkBk′|control) be the analogous frequencies of the genotypes AkAk′ and joint frequencies of alleles Ak and Bk′ in two different gametes. As measures of HWD, let dA pA1A1 −(pA1)2, dA|case pA1A1|case −(pA1|case)2 and dA|control pA1A1|control −(pA1|control)2for the population, cases and controls, respectively. For haplotype level HWD, we define the following:
Then and .
To parameterize LD, let ΔA1B1 pA1B1 – pA1B1, ΔA1B1|case pA1B1|case – pA1|casepB1|case and ΔA1B1|control pA1B1|control – pA1|controlpB1|control be the corresponding parameters between the two loci. The composite LD parameters, which do not require any information about phase, are respectively defined as , and in the population, cases and controls, respectively. Finally, it should be kept in mind that this notation is also applied to a disease locus (D) by correspondingly replacing the notation for the locus and alleles.
When there are markers around a disease locus D, where D1 (D2) denotes a disease (normal) allele, association can be detected from the LD between the marker and the disease alleles, and we have shown that the information for association in case-control studies consists of three different parts [Won and Elston, 2008]:
However, haplotypes are usually unknown and only the genotype at each locus is observed. Three possible ways that handle phase uncertainty have been used. First, the most probable haplotype pair for each individual can be estimated and considered as the true phase. Second, a weighted average of haplotypes can be used. Third, we can use the composite LD [Wang et al., 2007; Zaykin, 2004; Zaykin et al., 2006], which does not require any information about phase. We denote these methods as follows:
Also, as has been known for some time, if the variance for ILD is used in place of the true variances of the three corresponding statistics, the type I error can be either inflated or deflated; this will also be seen in our results. Thus, the variances of the statistics corresponding to , and need to be derived.
First, we consider the most probable haplotypes and weighted averages of haplotypes obtained via haplotype frequency estimation. If the haplotype frequencies are known, then, letting w = pA1B1pA1B1/(pA1B1pA1B1 + pA1B2pA2B1), the most probable haplotypes and the weighted haplotypes for loci A and B result in the following for the phase-unknown double heterozygote genotype A1B2A2B1:
Thus, if we let N and NAkBkAk′Bk′ be the sample size and the number of individuals with genotype AkBkAk′Bk′ in either cases or controls, the estimated parameters for LD and can be denoted
By defining if W = I(w > 1/2) for the indicator function I, and if W = w, we can subsume both estimates in a general way, using W, as follows:
To derive the variance and covariances with the estimators for Ia and IHWD, we parameterize in terms of the random variables defined above, as follows:
because and we obtain the results under Hardy-Weinberg equilibrium (HWE) using the delta method in a way analogous to that in Won and Elston [Won and Elston, 2008]:
Next, the composite LD can be estimated as follows [Weir, 1996]:
To derive its variance and relevant covariances, we parameterize it in terms of the random variables defined above as follows:
With these results, we find that the expectations, variances [Weir, 1996] and covariances under HWE in either cases or controls are as follows:
(see the Appendix for HWD). These show that the expectation and covariances of , but not its variance, are approximately equal to those of under HWE. Also, the equivalence of to ΔA1B1 under HWE guarantees that
With these results, we can consider the following three LD contrast test statistics for association analysis in case-control studies:
To elucidate the relative efficiency of the parameters for LD, the information in , and needs to be quantified and compared. If we assume HWE in the population and respectively let ϕ and ϕl be the disease prevalence and the penetrance of genotype l, the parameters for Ia, IHWD and ILD in cases and controls are as follows [Nielsen et al., 1998; Won and Elston, 2008]:
The quantification of and can be generalized using :
and can also be quantified analogously.
The quantification of requires the joint frequencies of alleles A1 and B1 in two different gametes because is defined as . Under HWE in the population we have:
where the derivation is based on the fact that . Thus, under HWE the composite LD between markers in cases is
Analogously, for controls, we have:
The cystic fibrosis transmembrane conductance regulator (CFTR) is located across 200 Kb in region q31.2 on the long arm of chromosome 7. From the HapMap database, we downloaded the marker data of the CFTR haplotype in the CEU (CEPH-Utah resident) sample. We randomly selected, to generate separate genotypes for each of 4,000 persons (2,000 representing cases and 2,000 representing controls), 11 SNPs whose minor allele frequencies are larger than 0.1 and used the SNP located in the middle (with allele frequency ranging between 0.1 and 0.5) as a causal SNP for the 2,000 cases, the others being used as markers. To study empirical type I error, we assume homozygous and heterozygote disease genotype relative risks λ2 and λ1 equal to 1, and randomly generated a binary trait. To study empirical power, we assumed λ2=1.4, λ1 being determined by the disease mode of inheritance: λ1 = 1 for a recessive disease, λ1 = λ2 for a dominant disease, for an additive disease and for a multiplicative disease. We considered two marker densities: (i) sparse markers, the consecutive markers being separated by about 30 Kb, and (ii) dense markers, separated by about 5Kb. For each situation, the empirical power and size was calculated from 1,000 replicates of 2,000 cases and 2,000 controls. Thus, in all, 4,000× 1,000 = 4,000,000 separate genotypes were generated.
In these simulations, the statistics, Sa and SHW were calculated at the marker closest to the causal SNP for the empirical power. SLD requires information about phase, and so the haplotypes for each individual were first estimated from the 10 markers with the default options of the program DECIPHER in S.A.G.E.1 and then the most probable pair of haplotypes assumed to be the true haplotypes. Table 1 shows that for sparse markers, whereas the type I errors for Sa and SHWD are little inflated or deflated, SLD can be too conservative. Also, it shows that Sa is the most powerful statistic and the others are not very informative. Contrary to the results for sparse markers, Table 2 shows that, for dense markers, not only do all three statistics preserve type I error relatively well (though is a little liberal), but also SLD yields the most informative test. Thus, when markers are not dense enough, ILD can be almost uninformative, and assigning the most probable haplotypes as true haplotypes results in inflated (or sometimes deflated) type I error. If the markers are dense, on the other hand, we can assume the most probable haplotypes are the true haplotypes and ILD can be more informative than Ia. It has in fact been shown that the power of SLD depends on high multi-locus LD between marker and disease alleles [Won and Elston, 2008], which explains why ILD can be more informative than Ia when the markers are dense.
From the previous simulations, we confirmed that for sparse markers the type I error of SLD can be inflated if the most probable haplotypes are assumed to be true haplotypes. We now examine with simulations both the validity and efficiency of the three LD contrast tests modified for phase uncertainty: and . In particular the two-marker haplotype frequencies for and are estimated by the EM algorithm [Excoffier and Slatkin, 1995]. The disease prevalence and the disease allele frequency are assumed to be 0.15 and 0.1, respectively. We used λ2 =1.4, λ1 being determined by the disease mode of inheritance. For the empirical type I error, we assume LE between marker and disease alleles and, for the empirical power, we considered three different LD structures: (1) ΔA1D1 = 0.076, ΔD1B1 = 0.023, ΔA1B1 = 0.06 and ΔA1D1B1 = 0.013; (2) ΔA1D1 = 0.076, ΔD1B1 = 0.047, ΔA1B1 = 0.06 and ΔA1D1B1 = 0.03; and (3) ΔA1D1 = 0.076, ΔD1B1 = 0.068, ΔA1B1 = 0.06 and ΔA1D1B1 = 0.049. These values represent three increasing levels of three-locus LD.
Figures Figures22 and and33 show the empirical type I error from 20,000 replicates of 10,000 cases and 10,000 controls as a function of LD between markers when pA1 and pB1 are 0.2 and 0.3, respectively. For w, we considered two different situations: (i) the haplotype frequencies in cases and controls are the same, so that the weights, w, are equal for cases and controls, and (ii) the haplotype frequencies in cases and controls are different. The results show that estimation of the weight, w, can result in preserved type I errors for and under case (i) but results in inflated type I error under case (ii), while always preserves the type I error well. However, when the absolute values of Lewontin's D′ [Lewontin, 1964] between markers are high, in both situations the inflation of type I error is negligible. Thus we conclude that type I error is preserved well as long as the same weights are used for cases and controls.
Figures Figures4,4, ,55 and and66 show the average empirical power from 20,000 replicates for three different LD structures as a function of sample size; for w in and , we assume that the haplotype frequencies are the same in cases and controls. The sample sizes for cases and controls are equal. From the results, we can confirm that the LD contrast test ( and ) can be more powerful than Sa with LD structure (3), because there is high three-locus LD between marker and disease alleles. Also, as was shown before, is better than and for a recessive disease. Otherwise, loses power because its variance is approximately twice as large as the variance of or .
Figure 7 shows, for pA1 and pB1 = 0.2, the analytical correlations of A1 and A with , and when the haplotype frequencies are known (panels a and b), and the empirical correlations of Sa and SHWD with and (panels c and d) obtained from 20,000 replicates of 10,000 cases and 10,000 controls under LE between marker and disease alleles. Overall, panels a and b are similar to panels c and d, respectively, even though haplotype frequencies are assumed to be known for the former and not for the latter. Figure 7 panels c and d also show that, for D′ > 0, Sa is least correlated with, and most with , while SHWD is least correlated with and most with . In particular, the correlations of SHWD with the parameters for LD are relatively small because HWE is assumed in the population. As a result, either or may then be appropriate for self-replication as long as they have comparable informativity.
The above simulations confirmed that the methods for handling phase uncertainty preserve the type I error in certain situations. We now examine quantification of the standardized expected differences between cases and controls in order to analytically compare their efficiencies. If we let SED be the standardized expected difference for a single case and a single control, defined as
and if we let Φ(·) and Z1-α/2 be respectively the cumulative standard normal distribution and the 1-α/2 quantile of the standard normal distribution, the power for N cases and N controls is equal to . The required sample size for given power and significance level can be plotted as a function of the SED [Won and Elston, 2008]. In Figures Figures88 and and9,9, we plot SEDs as a function of LD for various modes of inheritance. Again, the disease allele frequency was assumed to be 0.1 and we used λ2 =1.4, λ1 being determined by the disease mode of inheritance. We assume the disease prevalence is 0.15 and the minor allele frequencies for the markers A and B are 0.2. Figure 8 shows the SED as a function of ΔA1D1, when ΔA1D1B1 = 0.25ΔA1D1 and ΔB1D1 = 0.2ΔA1D1. These results show that is less powerful than SLD except in the case of a recessive disease. The expectation of the statistic for is slightly larger than that of ILD , but the variance is much larger; if ΔA1B1 is small, the variance of the composite LD estimator is approximately twice as large as the variance of the LD estimator based on phase-known data. When the genetic effects are large, the variance of is much larger than that of ILD, while the expected differences for and ILD are similar, which explains the higher SED of ILD. In addition, it can be seen in Figure 8 how the disease mode of inheritance affects the SEDs of the three statistics. For a recessive model, the estimators for Ia and IHWD are relatively better, whereas the estimators for Ia and ILD ( and ) are better for the other disease models (see Won and Elston15 for further details). Figure 9 shows the SED as a function of ΔA1B1, when ΔA1D1 = 0.076, ΔD1B1 = 0.068 and ΔA1D1B1 = 0.049. As expected, the results show that the SED of is largest for a recessive disease and the SED of ILD is largest for the other disease models. The SEDs of ILD and are inversely proportional to the magnitude of ΔA1B1 because the variances increase with ΔA1B1, though the expected difference does not depend on ΔA1B1. In addition, the SEDs for and are similar to that for ILD when ΔA1B1 is high, while the SEDs of and are almost negligible when ΔA1B1 is negative. As a result, we can conclude that is generally valid and efficient but, when ΔA1B1 is positive and the disease mode of inheritance is not recessive, either or is better when we have dense markers.
Table 3 shows the empirical type I error according to whether there is HWE or HWD in the population. pA1, pB1 and pD1 are assumed to be 0.2, 0.2 and 0.1, respectively, λ2 is 1.4 and λ1 is calculated according to the disease mode of inheritance. The disease prevalence is 0.15 and LE is assumed between marker and disease alleles. In all cases we assumed the parameter for haplotype-level HWD for the three loci (A, B, and D) is 6.4×10−5, which is the maximal amount possible in this situation. Two types of variances for Sa, SHWD and are considered: (i) calculated allowing for HWD, and (ii) calculated assuming HWE. Table 3, based on 20,000 replicates each with 10,000 cases and 10,000 controls, shows that while Sa, SHWD and preserve the type I errors at the significance levels 0.05 and 0.01 for all disease modes of inheritance in situation (i), the empirical type I errors are inflated in situation (ii). Also, it appears that the disease mode of inheritance and the type of statistic affect the level of inflation. Sa is inflated only for a dominant disease, while SHWD and are inflated for all disease modes of inheritance. In particular, is the most sensitive to HWD, which indicates that HWE in the population should be confirmed a priori for the LD contrast test.
Haplotype-level HWD under LE between marker and disease alleles can result in differences between cases and controls for three types of parameters. For example, the allele frequencies for cases and controls are
These results indicate that the allele frequencies can be different between cases and controls under the null hypothesis ΔA1D1 =0, if there is haplotype-level HWD in the population. However, Table 3 shows that type I error is preserved well and its effect is negligible for most situations as long as we use the variances that allow for HWD.
In addition to confirming previous results [Won and Elston, 2008], we have shown that, if markers are dense enough, the LD contrast test can be more informative than Sa. Also, we extended SLD to , and , the last of which does not require any information about phase. With this extension, we made all the types of information feasible for testing association in a case-control study. and need estimates of the haplotype frequencies and our simulation shows that haplotype frequency estimation does not substantially inflate the type I error as long as the same weights are used for cases and controls. Otherwise, it greatly inflates type I error when the absolute LD between marker alleles is small, while type I error is preserved for high LD between markers. Our results also show that the variance of is approximately twice as large as the variance of ILD.
The decomposition of the parameters available for a case-control study enable self-replication and the increase in power resulting from self-replication can be understood in terms of methods for combining p-values. The most powerful method has been derived for when the SEDs are known and this method can be extended to statistics that are correlated16. Also, it should be remembered that among the possible choices for phase uncertainty is the least appropriate when the markers are not dense because it is then the most correlated with the statistics that use the other types of information (see Figure 7). If the disease mode of inheritance is known, in the case of a recessive disease we suggest combining the p-values from Sa, SHWD and . For a dominant disease, we should combine the p-values from Sa, SHWD and (or ) when markers are dense, and the p-values from Sa, SHWD and when the markers are not dense. For multiplicative and additive disease modes of inheritance, we should consider Sa and (or ) when the markers are dense, and Sa and when the markers are not dense. If the disease mode of inheritance is unknown, we should first decide on the mode of inheritance using SHWD [Wittke-Thompson et al., 2005; Zheng and Ng, 2008], and then combine Sa and (or ), or Sa and according to the disease mode of inheritance and marker density (Figure 10 shows the SED as a function of λ1).
However, in spite of the efficiency and validity of the proposed method, there are still problems that need attention. First, the suggested strategy that determines the disease mode of inheritance with SHWD does not use SHWD to indicate association per se. As has been shown, the most powerful method is attainable only when the SEDs of the statistics are known; the SED for IHWD needs information about the disease mode of inheritance. The new method of determining the disease mode of inheritance with type I error preserved requires further investigation. Second, we need to have estimates of the covariances to combine the three types of information if HWE in the population is violated. If HWE in the population may be assumed, the covariances can be estimated from the equivalence of the composite LD and LD. However, under HWD the equivalence is violated and even more parameters need to be estimated that require information about phase.
In summary, even though there have now been many investigations using GWA, except for genetic mapping based on linkage disequilibrium units [Collins and Lau, 2008], they have so far considered only the statistic based on Ia for an initial scan, which can be powerless in some cases. Also, application to real marker data indicates that the LD contrast test can be very informative in some situations, so that combining p-values from Sa and can increase the statistical power of GWA studies. The proposed method is optimal and efficient from the viewpoint of using all the information available.
This work was supported in part by a U.S. Public Health Service Resource grant (RR03655) from the National Center for Research Resources, Research grant (GM28356) from the National Institute of General Medical Sciences and Cancer Center Support Grant (P30CAD43703) from the National Cancer Institute.
Before we derive the expectations, variances and covariances, we define the following notation that is used in Weir [Weir, 1996]:
where it should be noted that . Under HWE, because , .