PMCCPMCCPMCC

Search tips
Search criteria 

Advanced

 
Logo of nihpaAbout Author manuscriptsSubmit a manuscriptHHS Public Access; Author Manuscript; Accepted for publication in peer reviewed journal;
 
Genet Epidemiol. Author manuscript; available in PMC 2010 September 1.
Published in final edited form as:
Genet Epidemiol. 2009 September; 33(6): 463–478.
doi:  10.1002/gepi.20399
PMCID: PMC2838926
NIHMSID: NIHMS87870

Phase uncertainty in case-control association studies

Abstract

The possible evidence for association comprises three types of information: differences between cases and controls in allele frequencies, in parameters for Hardy Weinberg disequilibrium (HWD), and in parameters for linkage disequilibrium (LD). LD between marker and disease alleles results in a difference in at least one of the three types of parameters [Won and Elston, 2008]. However, the parameters for LD require knowledge about phase, which is usually unknown, making the LD contrast test without modification infeasible in practice. Methods for handling phase uncertainty are: (1) the most probable haplotype pair for each individual can be considered as the true phase; (2) a weighted average of haplotypes can be used; (3) we can consider the composite LD, which does not require any information about phase. We compare these methods to handle phase uncertainty in terms of validity and efficiency, and the effect on them of HWD in the population, at the same time confirming results for the three types of information. When the LD between markers is high, the LD contrast test that uses a weighted average of haplotypes or the most probable haplotypes to calculate the LD is recommended, but otherwise the LD contrast test that uses the composite LD is recommended. We conclude that, even though the difference in allele frequencies is usually the most informative test except in the case of a recessive disease, the LD contrast test can be more powerful if the markers are dense enough.

Keywords: linkage disequilibrium, haplotype phase, self replication

Introduction

The goal of population association studies is to identify genetic polymorphisms that distinguish among individuals with different disease status, and it has been shown that an association between marker and disease alleles can be detected as a result of gametic phase disequilibrium for linked loci, which is what we shall mean here by linkage disequilibrium (LD). Thus, today LD can play a key role in genetic epidemiology and the advent of SNPs has enabled association studies at the genome-wide level. However, notwithstanding their efficiency, the huge number of SNPs has generated the problem of multiple testing and statistical methods to combat this issue have been investigated. SNP reduction, sometimes called self-replication [Van Steen et al., 2005; Zheng et al., 2007] is a possible solution to this problem when we have two independent statistics that can be used to test the same biological hypothesis. For example, suppose we have 500,000 equally-spaced SNPs in a genome-wide association (GWA) study, i.e. SNPs about 6Kb apart, then correction for multiple tests would require a Bonferroni-corrected p-value of 0.05/500,000 for genome-wide significance at the 0.05 level, which is difficult to meet in practice. However, suppose we have two independent statistics (t1 and t2) and their p-values are u1 and u2. Then, if 1% of the SNPs are selected by screening based on either of the two independent statistics, and only the selected SNPs are used for a testing phase with the other statistic, we now only need a Bonferroni-corrected p-value of 0.05/5,000 at this testing phase to have genome-wide significance at the 0.05 level. This self-replication can be reformulated in terms of combining p-values. The rejection region for t1 (or t2), in this example of self-replication is asymptotically u1<0.05/5,000 with u2<0.01 (or u2<0.05/5,000 with u1<0.01). We can also use Fisher's method or Liptak's method [Liptak, 1958] for combining p-values (see Figure 1 for their rejection regions). As a result, better approaches than self-replication can be attained as long as we use the optimal method for combining u1 and u2; and we have shown that the most powerful method can be obtained under certain circumstances even though no uniformly most powerful test is possible [Won et al., 2008].

Figure 1
Rejection regions

It has been shown that LD between marker and disease alleles can result in differences of the following parameters between cases and controls: allele frequencies, the parameters for Hardy-Weinberg disequilibrium (HWD) and the parameters for LD between markers [Won and Elston, 2008]. The fact that the phase-known genotype frequencies can be parameterized in terms of these three parameters shows they can each be used for association studies. In principle, there is evidence for association if any one of the parameters is different between cases and controls provided certain conditions - such as no genotyping error – hold, and the three types of information can have different levels of power according to the disease mode of inheritance and LD structure. However, even though these results provide intuition about an optimal strategy for analysis, their application to a practical situation is restricted because the usual LD contrast test requires information about phase, which is often unknown. Three ways of overcoming the phase uncertainty have been considered. First, the most probable haplotype pair for each individual can be considered as the true phase. Second, a weighted average of haplotypes can be used. Third, we can use the composite LD [Wang et al., 2007; Zaykin, 2004; Zaykin et al., 2006], which does not require any information about phase. The validity and efficiency of these three approaches have not yet been investigated. We therefore compare these approaches to handle phase uncertainty in terms of their type I error and power according to disease mode of inheritance and SNP density. At the same time, we investigate the effect of HWD in the population on the analysis. Our results show that, whereas the composite LD always preserves the type I error with reduced power, the methods that use the most probable haplotype or weighted haplotypes for each individual can have better power with type I error preserved provided dense markers are used. We further show that, when the markers are dense, the LD contrast test can be more powerful than a test based on allele frequency differences.

Methods

Notation

Let Xij and Yij be random variables for marker loci A and B defined by

Xij={1if an allele at locusAin haplotypejof individualiisA10if an allele at locusAin haplotypejof individualiisA2}Yij={1if an allele at locusBin haplotypejof individualiisB10if an allele at locusBin haplotypejof individualiisB2}

where j = 1 (2) indicates a maternal (paternal) haplotype. For given marker loci A and B, let pAk, pAkAk, pAkBk, pAkBkAkBk, pAkBkAkBk and pAk,Bk be respectively the frequencies of allele Ak , genotype AkAk, haplotype AkBk, phase-unknown genotype AkBkAk'Bk, phase-known genotype AkBk/Ak'Bk, and the joint frequencies of alleles Ak and Bk of two different gametes in the population, where, for the SNPs that are used, there are usually only two alleles, A1 (B1) and A2 (B2). Also, for locus A, let pAk|case and pAk|control be the frequencies of allele Ak in the case group and control group, respectively, and let pAkAk′|case(pAkAk′|control) and pAkBk′|case(pAkBk′|control) be the analogous frequencies of the genotypes AkAk and joint frequencies of alleles Ak and Bk in two different gametes. As measures of HWD, let dA [equivalent] pA1A1 −(pA1)2, dA|case [equivalent] pA1A1|case −(pA1|case)2 and dA|control [equivalent] pA1A1|control −(pA1|control)2for the population, cases and controls, respectively. For haplotype level HWD, we define the following:

d12ABpA1B1pA1B212pA1B2A1B1,d13ABpA1B1pA2B112pA2B1A1B1d14ABpA1B1pA2B212pA2B2A1B1,d23ABpA1B2pA2B112pA2B1A1B2d24ABpA1B2pA2B212pA2B2A1B2,d34ABpA2B1pA2B212pA2B2A2B1.

Then dA=(d13AB+d14AB+d23AB+d24AB) and pA1B1pA1B1pA1B1A1B1=d12ABd13ABd14AB.

To parameterize LD, let ΔA1B1 [equivalent] pA1B1pA1B1, ΔA1B1|case [equivalent] pA1B1|casepA1|casepB1|case and ΔA1B1|control [equivalent] pA1B1|controlpA1|controlpB1|control be the corresponding parameters between the two loci. The composite LD parameters, which do not require any information about phase, are respectively defined as ΔA1B1C=pA1B1+pA1,B12pA1pB1, ΔA1B1caseC=pA1B1case+pA1,B1case2pA1casepB1case and ΔA1B1controlCpA1B1control+pA1,B1control2pA1controlpB1control in the population, cases and controls, respectively. Finally, it should be kept in mind that this notation is also applied to a disease locus (D) by correspondingly replacing the notation for the locus and alleles.

Phase uncertainty

When there are markers around a disease locus D, where D1 (D2) denotes a disease (normal) allele, association can be detected from the LD between the marker and the disease alleles, and we have shown that the information for association in case-control studies consists of three different parts [Won and Elston, 2008]:

  • Ia: trend (or difference) in allele frequencies
  • IHWD: trend (or difference) in allele frequencies
  • ILD: trend (or difference) in LD.

For each type of information, we define the three principal statistics as follows [Nielsen et al., 1998; Nielsen et al., 2004; Sasieni, 1997; Song and Elston, 2006]:

Sa=p^A1casep^A1controlvar(p^A1casep^A1control)SHWD=d^Acased^Acontrolvar(d^Acased^Acontrol)SLD=Δ^A1B1caseΔ^A1B1controlvar(Δ^A1B1caseΔ^A1B1control).

However, haplotypes are usually unknown and only the genotype at each locus is observed. Three possible ways that handle phase uncertainty have been used. First, the most probable haplotype pair for each individual can be estimated and considered as the true phase. Second, a weighted average of haplotypes can be used. Third, we can use the composite LD [Wang et al., 2007; Zaykin, 2004; Zaykin et al., 2006], which does not require any information about phase. We denote these methods as follows:

  • ILDm: trend (or difference) in LD using the most probable haplotypes
  • ILDw: trend (or difference) in LD using weighted haplotypes
  • ILDc: trend (or difference) in composite LD.

Also, as has been known for some time, if the variance for ILD is used in place of the true variances of the three corresponding statistics, the type I error can be either inflated or deflated; this will also be seen in our results. Thus, the variances of the statistics corresponding to ILDm, ILDw and ILDc need to be derived.

First, we consider the most probable haplotypes and weighted averages of haplotypes obtained via haplotype frequency estimation. If the haplotype frequencies are known, then, letting w = pA1B1pA1B1/(pA1B1pA1B1 + pA1B2pA2B1), the most probable haplotypes and the weighted haplotypes for loci A and B result in the following for the phase-unknown double heterozygote genotype A1B2A2B1:

most probable haplotypesA1B2A2B1{A1B1A2B2ifwis larger than0.5A1B2A2B1otherwise}weighted haplotypesA1B2A2B1{A1B1A2B2with probabilitywA1B2A2B1with probability1w.}

Thus, if we let N and NAkBkAkBk be the sample size and the number of individuals with genotype AkBkAkBk in either cases or controls, the estimated parameters for LD Δ^LDm and Δ^LDw can be denoted

Δ^A1B1m1N[NA1A1B1B1+12NA1A1B1B2+12NA1A2B1B1+12NA1A2B1B2I(w>12)]p^A1p^B1andΔ^A1B1w1N[NA1A1B1B1+12NA1A1B1B2+12NA1A2B1B1+w2NA1A2B1B2]p^A1p^B1.

By defining Δ^A1B1W=Δ^A1B1m if W = I(w > 1/2) for the indicator function I, and Δ^A1B1W=Δ^A1B1w if W = w, we can subsume both estimates in a general way, using W, as follows:

Δ^A1B1W1N[NA1A1B1B1+12NA1A1B1B2+12NA1A2B1B1+W2NA1A2B1B2]p^A1p^B1.

To derive the variance and covariances with the estimators for Ia and IHWD, we parameterize Δ^A1B1W in terms of the random variables defined above, as follows:

Δ^A1B1W=12NΣi=1N[2Xi1Xi2Yi1Yi2+(1Xi1)Xi2Yi1Yi2+Xi1(1Xi2)Yi1Yi2]+Xi1Xi2(1Yi1)Yi2+Xi1Xi2Yi1(1Yi2)+W(1Xi1)Xi2(1Yi1)Yi2+[WXi1(1Xi2)(1Yi1)Yi2+W(1Xi1)Xi2Yi1(1Yi2)+WXi1(1Xi2)Yi1(1Yi2)]p^A1p^B1=Δ^A1B1+12NΣi=1N[{W(Xi1(1Xi2)(1Yi1)Yi2+(1Xi1)Xi2Yi1(1Yi2))}][(1W){(1Xi1)Xi2(1Yi1)Yi2+Xi1(1Xi2)Yi1(1Yi2)}],

because Δ^A1B1=Σi=1N[Xi1Yi1+Xi2Yi2]2Np^A1p^B1 and we obtain the results under Hardy-Weinberg equilibrium (HWE) using the delta method in a way analogous to that in Won and Elston [Won and Elston, 2008]:

E(Δ^A1B1W)=ΔA1B1(pA1B1+ΔA1B1){(1pA1)(1pB1)+ΔA1B1}+W[2pA1pB1×][(1pA1)(1pB1)+ΔA1B1{(12pA1)(12pB1)+2ΔA1B1}]ΔA1B12N
var(Δ^A1B1W)=1N[(12W){14pA1qA1pB1qB1(4pA1qA1pB1qB1+2bA1bB11)+ΔA1B1(2pA12pB12(2qA1qB1+qA1+qB1))}]+(2pA1pB1(pA1qA1+pB1qB12qA1qB1)+18(2(pA1pB1)2+6(pA1qA1+pB1qB1)1))+ΔA1B12(pA1qA1+pB1qB16pA1qA1pB1qB1)+ΔA1B13(1bA1bB1)ΔA1B14+W(12pA1qA1pB1qB1(4pA1qA1pB1qB11)+14ΔA1B1bA1bB1(8pA1qA1pB1qB1)[{(1)+2ΔA1B12(6pA1qA1pB1qB1pA1qA1pB1qB1)+2ΔA1B13bA1bB1+2ΔA1B14)}+18(2pA1qA1pB1qB1+ΔA1B1bA1bB1)]
cov(p^A1,Δ^A1B1W)=12N(12pA1)[ΔA1B1+2WpA1B2pA2B12(1W)pA1B1pA2B2]+O(1N2)cov(d^A,Δ^A1B1W)=12NpA1[WpA1B2pA2B1(1W)pA1B1pA2B2]+O(1N2).

Next, the composite LD can be estimated as follows [Weir, 1996]:

Δ^A1B1C1N[2NA1A1B1B1+NA1A1B1B2+NA1A2B1B1+12NA1A2B1B2]2p^A1p^B1.

To derive its variance and relevant covariances, we parameterize it in terms of the random variables defined above as follows:

Δ^A1B1C=12NΣi=1N[4Xi1Xi2Yi1Yi2+2(1Xi1)Xi2Yi1Yi2+2Xi1(1Xi2)Yi1Yi2]+2Xi1Xi2(1Yi1)Yi2+2Xi1Xi2Yi1(1Yi2)+(1Xi1)Xi2(1Yi1)Yi2[+Xi1(1Xi2)(1Yi1)Yi2+(1Xi1)Xi2Yi1(1Yi2)+Xi1(1Xi2)Yi1(1Yi2)]2p^A1p^B1=Δ^A1B1+12NΣi=1N[Xi1Yi2+Xi2Yi1]p^A1p^B1.

With these results, we find that the expectations, variances [Weir, 1996] and covariances under HWE in either cases or controls are as follows:

E(Δ^A1B1C)=ΔA1B1ΔA1B1N,var(Δ^A1B1C)=1N[pA1(1pA1)+pB1(1pB1)+12(12pA1)(12pB1)ΔA1B1].

cov(p^A1,Δ^A1B1C)=ΔA1B1(12pA1)2N+O(1N2),andcov(d^A,Δ^A1B1C)=1NΔA1B1pA1(1pA1)+O(1N2).

(see the Appendix for HWD). These show that the expectation and covariances of Δ^A1B1C, but not its variance, are approximately equal to those of Δ^A1B1 under HWE. Also, the equivalence of ΔA1B1C to ΔA1B1 under HWE guarantees that

var(Δ^A1B1C)=1N[pA1(1pA1)pB1(1pB1)+12(12pA1)(12pB1)ΔA1B1C]andcov(p^A1,Δ^A1B1C)=ΔA1B1C(12pA1)2N+O(1N2).

With these results, we can consider the following three LD contrast test statistics for association analysis in case-control studies:

SLDm=Δ^A1B1casemΔ^A1B1controlmvar(Δ^A1B1casemΔ^A1B1controlm)SLDw=Δ^A1B1casewΔ^A1B1controlwvar(Δ^A1B1casewΔ^A1B1controlw)SLDc=Δ^A1B1casecΔ^A1B1controlcvar(Δ^A1B1casecΔ^A1B1controlc).

Quantification of the parameters for LD

To elucidate the relative efficiency of the parameters for LD, the information in ILDm, ILDw and ILDC needs to be quantified and compared. If we assume HWE in the population and respectively let ϕ and ϕl be the disease prevalence and the penetrance of genotype l, the parameters for Ia, IHWD and ILD in cases and controls are as follows [Nielsen et al., 1998; Won and Elston, 2008]:

pA1case=pA1+ΔA1D1ϕ[pD1(ϕD1D1ϕD1D2)+pD2(ϕD1D2ϕD2D2)]pA1control=pA1+ΔA1D11ϕ[pD1(ϕD1D1ϕD1D2)+pD2(ϕD1D2ϕD2D2)]dAcase=pA1A1case(pA1case)2=ΔA1D12ϕ(ϕD1D1ϕD2D2ϕD1D22)dAcontrol=pA1A1control(pA1control)2=ΔA1D12(1ϕ)2((1ϕD1D1)(1ϕD2D2)(1ϕD1D2)2)

ΔA1B1case=ΔA1B1+1ϕ[pD(ϕDDϕDd)+pd(ϕDdϕdd)]×[ΔA1DB11ϕ(pD(ϕDDϕDd)+pd(ϕDdϕdd))ΔA1DΔDB1]ΔA1B1control=ΔA1B1+11ϕ[pD(ϕDDϕDd)+pd(ϕDdϕdd)]×[ΔA1DB111ϕ(pD(ϕDDϕDd)+pd(ϕDdϕdd))ΔA1DΔDB1].

The quantification of ΔA1B1m and ΔA1B1w can be generalized using ΔA1B1W:

ΔA1B1caseW=(2W1)pA1casepA2casepB1casepB2case+ΔA1B1case{1+(W1)(pA1casepB1case}{+pA2casepB2caseW(pA1casepB2case+pA2casepB1case)}+(2W1)ΔA1B1case2,

and ΔA1B1controlW can also be quantified analogously.

The quantification of ΔA1B1C requires the joint frequencies of alleles A1 and B1 in two different gametes because ΔA1B1C is defined as ΔA1B1C=pA1B1+pA1B12pA1pB1. Under HWE in the population we have:

pA1B1case=1ϕΣj{DD,Dd,dd}[P(A1B1A1B1,j,affected)+12{P(A1B2A1B1,j,affected)}][{+P(A1B1A2B1,j,affected)+P(A1B2A2B1,j,affected)}]=1ϕ[P(A1D1)P(D1B1)ϕD1D1+{P(A1D1)P(D2B1)+P(A1D2)P(D1B1)}ϕD1D2]+P(A1D2)[P(D2B1)ϕD2D2]=pA1pB1+1ϕ[(pA1ΔD1B1+pB1ΔA1D1)]{pD(ϕDDϕDd)+pd(ϕDdϕdd)}[+ΔA1D1ΔD1B1(ϕD1D12ϕD1D2+ϕD2D2)],

where the derivation is based on the fact that pA1B1=pA1B1A1B1+12(pA1B1A1B2+pA2B1A1B1+pA2B1A1B2). Thus, under HWE the composite LD between markers in cases is

ΔA1B1caseC=ΔA1B1case+pA1B1casepA1casepB1case=ΔA1B1+ΔA1D1ΔD1B1ϕ[ϕDD2ϕDd+ϕdd]+1ϕ[pD(ϕDDϕDd)+pd(ϕDdϕdd)]×[ΔA1DB12ϕ(pD(ϕDDϕDd)+pd(ϕDdϕdd))ΔA1DΔDB1].

Analogously, for controls, we have:

ΔA1B1controlC=ΔA1B1control+pA1B1controlpA1controlpB1control=ΔA1B1+ΔA1D1ΔD1B11ϕ[ϕDD2ϕDd+ϕdd]+11ϕ[pD(ϕDDϕDd)+pd(ϕDdϕdd)]×[ΔA1DB121ϕ(pD(ϕDDϕDd)+pd(ϕDdϕdd))ΔA1DΔDB1].

Results

Simulations using real marker data

The cystic fibrosis transmembrane conductance regulator (CFTR) is located across 200 Kb in region q31.2 on the long arm of chromosome 7. From the HapMap database, we downloaded the marker data of the CFTR haplotype in the CEU (CEPH-Utah resident) sample. We randomly selected, to generate separate genotypes for each of 4,000 persons (2,000 representing cases and 2,000 representing controls), 11 SNPs whose minor allele frequencies are larger than 0.1 and used the SNP located in the middle (with allele frequency ranging between 0.1 and 0.5) as a causal SNP for the 2,000 cases, the others being used as markers. To study empirical type I error, we assume homozygous and heterozygote disease genotype relative risks λ2 and λ1 equal to 1, and randomly generated a binary trait. To study empirical power, we assumed λ2=1.4, λ1 being determined by the disease mode of inheritance: λ1 = 1 for a recessive disease, λ1 = λ2 for a dominant disease, λ1=λ2+12 for an additive disease and λ1=λ2 for a multiplicative disease. We considered two marker densities: (i) sparse markers, the consecutive markers being separated by about 30 Kb, and (ii) dense markers, separated by about 5Kb. For each situation, the empirical power and size was calculated from 1,000 replicates of 2,000 cases and 2,000 controls. Thus, in all, 4,000× 1,000 = 4,000,000 separate genotypes were generated.

In these simulations, the statistics, Sa and SHW were calculated at the marker closest to the causal SNP for the empirical power. SLD requires information about phase, and so the haplotypes for each individual were first estimated from the 10 markers with the default options of the program DECIPHER in S.A.G.E.1 and then the most probable pair of haplotypes assumed to be the true haplotypes. Table 1 shows that for sparse markers, whereas the type I errors for Sa and SHWD are little inflated or deflated, SLD can be too conservative. Also, it shows that Sa is the most powerful statistic and the others are not very informative. Contrary to the results for sparse markers, Table 2 shows that, for dense markers, not only do all three statistics preserve type I error relatively well (though SLDC is a little liberal), but also SLD yields the most informative test. Thus, when markers are not dense enough, ILD can be almost uninformative, and assigning the most probable haplotypes as true haplotypes results in inflated (or sometimes deflated) type I error. If the markers are dense, on the other hand, we can assume the most probable haplotypes are the true haplotypes and ILD can be more informative than Ia. It has in fact been shown that the power of SLD depends on high multi-locus LD between marker and disease alleles [Won and Elston, 2008], which explains why ILD can be more informative than Ia when the markers are dense.

Table 1
Simulations with sparse marker data
Table 2
Simulations with dense marker data

Simulations using simulated marker data

From the previous simulations, we confirmed that for sparse markers the type I error of SLD can be inflated if the most probable haplotypes are assumed to be true haplotypes. We now examine with simulations both the validity and efficiency of the three LD contrast tests modified for phase uncertainty: SLDm,SLDw and SLDc. In particular the two-marker haplotype frequencies for SLDm and SLDw are estimated by the EM algorithm [Excoffier and Slatkin, 1995]. The disease prevalence and the disease allele frequency are assumed to be 0.15 and 0.1, respectively. We used λ2 =1.4, λ1 being determined by the disease mode of inheritance. For the empirical type I error, we assume LE between marker and disease alleles and, for the empirical power, we considered three different LD structures: (1) ΔA1D1 = 0.076, ΔD1B1 = 0.023, ΔA1B1 = 0.06 and ΔA1D1B1 = 0.013; (2) ΔA1D1 = 0.076, ΔD1B1 = 0.047, ΔA1B1 = 0.06 and ΔA1D1B1 = 0.03; and (3) ΔA1D1 = 0.076, ΔD1B1 = 0.068, ΔA1B1 = 0.06 and ΔA1D1B1 = 0.049. These values represent three increasing levels of three-locus LD.

Figures Figures22 and and33 show the empirical type I error from 20,000 replicates of 10,000 cases and 10,000 controls as a function of LD between markers when pA1 and pB1 are 0.2 and 0.3, respectively. For w, we considered two different situations: (i) the haplotype frequencies in cases and controls are the same, so that the weights, w, are equal for cases and controls, and (ii) the haplotype frequencies in cases and controls are different. The results show that estimation of the weight, w, can result in preserved type I errors for SLDm and SLDw under case (i) but results in inflated type I error under case (ii), while SLDc always preserves the type I error well. However, when the absolute values of Lewontin's D′ [Lewontin, 1964] between markers are high, in both situations the inflation of type I error is negligible. Thus we conclude that type I error is preserved well as long as the same weights are used for cases and controls.

Figure 2
Empirical type I error with simulated marker data
Figure 3
Empirical type I error with simulated marker data

Figures Figures4,4, ,55 and and66 show the average empirical power from 20,000 replicates for three different LD structures as a function of sample size; for w in SLDm and SLDw, we assume that the haplotype frequencies are the same in cases and controls. The sample sizes for cases and controls are equal. From the results, we can confirm that the LD contrast test (SLDm,SLDw and SLDc) can be more powerful than Sa with LD structure (3), because there is high three-locus LD between marker and disease alleles. Also, as was shown before, SLDc is better than SLDm and SLDw for a recessive disease. Otherwise, SLDc loses power because its variance is approximately twice as large as the variance of SLDm or SLDw.

Figure 4
Empirical power when ΔA1D1 = 0.076, ΔD1B1 = 0.023, ΔA1B1 = 0.06 and ΔA1D1B1 = 0.013
Figure 5
Empirical power when ΔA1D1 = 0.076, ΔD1B1 = 0.047, ΔA1B1 = 0.06 and ΔA1D1B1 = 0.03
Figure 6
Empirical power when ΔA1D1 = 0.076, ΔD1B1 = 0.068, ΔA1B1 = 0.06 and ΔA1D1B1 = 0.049

Figure 7 shows, for pA1 and pB1 = 0.2, the analytical correlations of [p with inverted breve]A1 and dA with Δ^A1B1m, Δ^A1B1w and Δ^A1B1c when the haplotype frequencies are known (panels a and b), and the empirical correlations of Sa and SHWD with SLDm,SLDw and SLDc(panels c and d) obtained from 20,000 replicates of 10,000 cases and 10,000 controls under LE between marker and disease alleles. Overall, panels a and b are similar to panels c and d, respectively, even though haplotype frequencies are assumed to be known for the former and not for the latter. Figure 7 panels c and d also show that, for D′ > 0, Sa is least correlated with, SLDc and most with SLDm, while SHWD is least correlated with SLDm and most with SLDc. In particular, the correlations of SHWD with the parameters for LD are relatively small because HWE is assumed in the population. As a result, either SLDc or SLDw may then be appropriate for self-replication as long as they have comparable informativity.

Figure 7
Correlations

Quantification

The above simulations confirmed that the methods for handling phase uncertainty preserve the type I error in certain situations. We now examine quantification of the standardized expected differences between cases and controls in order to analytically compare their efficiencies. If we let SED be the standardized expected difference for a single case and a single control, defined as

SED=expectation of the difference statistic for one case and one controls.d. of the difference statistic for one case and one control,

and if we let Φ(·) and Z1-α/2 be respectively the cumulative standard normal distribution and the 1-α/2 quantile of the standard normal distribution, the power for N cases and N controls is equal to Φ(SEDNZ1α2)+Φ(SEDNZ1α2). The required sample size for given power and significance level can be plotted as a function of the SED [Won and Elston, 2008]. In Figures Figures88 and and9,9, we plot SEDs as a function of LD for various modes of inheritance. Again, the disease allele frequency was assumed to be 0.1 and we used λ2 =1.4, λ1 being determined by the disease mode of inheritance. We assume the disease prevalence is 0.15 and the minor allele frequencies for the markers A and B are 0.2. Figure 8 shows the SED as a function of ΔA1D1, when ΔA1D1B1 = 0.25ΔA1D1 and ΔB1D1 = 0.2ΔA1D1. These results show that SLDc is less powerful than SLD except in the case of a recessive disease. The expectation of the statistic for ILDc is slightly larger than that of ILD , but the variance is much larger; if ΔA1B1 is small, the variance of the composite LD estimator is approximately twice as large as the variance of the LD estimator based on phase-known data. When the genetic effects are large, the variance of ILDc is much larger than that of ILD, while the expected differences for ILDc and ILD are similar, which explains the higher SED of ILD. In addition, it can be seen in Figure 8 how the disease mode of inheritance affects the SEDs of the three statistics. For a recessive model, the estimators for Ia and IHWD are relatively better, whereas the estimators for Ia and ILD (ILDm,ILDw and ILDc) are better for the other disease models (see Won and Elston15 for further details). Figure 9 shows the SED as a function of ΔA1B1, when ΔA1D1 = 0.076, ΔD1B1 = 0.068 and ΔA1D1B1 = 0.049. As expected, the results show that the SED of ILDc is largest for a recessive disease and the SED of ILD is largest for the other disease models. The SEDs of ILD and ILDc are inversely proportional to the magnitude of ΔA1B1 because the variances increase with ΔA1B1, though the expected difference does not depend on ΔA1B1. In addition, the SEDs for ILDm and ILDw are similar to that for ILD when ΔA1B1 is high, while the SEDs of ILDm and ILDw are almost negligible when ΔA1B1 is negative. As a result, we can conclude that ILDc is generally valid and efficient but, when ΔA1B1 is positive and the disease mode of inheritance is not recessive, either ILDm or ILDw is better when we have dense markers.

Figure 8
SED as a function of ΔA1D
Figure 9
SED as a function of ΔA1B1

HWD in the population and type I error

Table 3 shows the empirical type I error according to whether there is HWE or HWD in the population. pA1, pB1 and pD1 are assumed to be 0.2, 0.2 and 0.1, respectively, λ2 is 1.4 and λ1 is calculated according to the disease mode of inheritance. The disease prevalence is 0.15 and LE is assumed between marker and disease alleles. In all cases we assumed the parameter for haplotype-level HWD for the three loci (A, B, and D) is 6.4×10−5, which is the maximal amount possible in this situation. Two types of variances for Sa, SHWD and SLDC are considered: (i) calculated allowing for HWD, and (ii) calculated assuming HWE. Table 3, based on 20,000 replicates each with 10,000 cases and 10,000 controls, shows that while Sa, SHWD and SLDC preserve the type I errors at the significance levels 0.05 and 0.01 for all disease modes of inheritance in situation (i), the empirical type I errors are inflated in situation (ii). Also, it appears that the disease mode of inheritance and the type of statistic affect the level of inflation. Sa is inflated only for a dominant disease, while SHWD and SLDC are inflated for all disease modes of inheritance. In particular, SLDC is the most sensitive to HWD, which indicates that HWE in the population should be confirmed a priori for the LD contrast test.

Table 3
Effect of maximum possible HWD in the population under LE

Haplotype-level HWD under LE between marker and disease alleles can result in differences between cases and controls for three types of parameters. For example, the allele frequencies for cases and controls are

pA1case=1ϕ[(PA1D12+PA1D1PA2D1+d12AD+d14AD)ϕD1D1+(2PA1D1PA1D2+PA1D1PA2D2+PA1D2PA2D1][2d12ADd14ADd23AD)ϕD1D2+(PA1D22+PA1D2PA2D2+d12AD+d23AD)ϕD2D2]=pA1+1ϕ[(ΔA1D1pD1+d12AD+d14AD)(ϕD1D1ϕD1D2)+(ΔA1D1pD2d12ADd23AD)(ϕD1D2ϕD2D2)],pA1control=pA111ϕ[(ΔA1D1pD1+d12AD+d14AD)(ϕD1D1ϕD1D2)+(ΔA1D1pD2d12ADd23AD)(ϕD1D2ϕD2D2)].

These results indicate that the allele frequencies can be different between cases and controls under the null hypothesis ΔA1D1 =0, if there is haplotype-level HWD in the population. However, Table 3 shows that type I error is preserved well and its effect is negligible for most situations as long as we use the variances that allow for HWD.

Discussion

In addition to confirming previous results [Won and Elston, 2008], we have shown that, if markers are dense enough, the LD contrast test can be more informative than Sa. Also, we extended SLD to SLDm, SLDw and SLDC, the last of which does not require any information about phase. With this extension, we made all the types of information feasible for testing association in a case-control study. SLDm and SLDw need estimates of the haplotype frequencies and our simulation shows that haplotype frequency estimation does not substantially inflate the type I error as long as the same weights are used for cases and controls. Otherwise, it greatly inflates type I error when the absolute LD between marker alleles is small, while type I error is preserved for high LD between markers. Our results also show that the variance of ILDC is approximately twice as large as the variance of ILD.

The decomposition of the parameters available for a case-control study enable self-replication and the increase in power resulting from self-replication can be understood in terms of methods for combining p-values. The most powerful method has been derived for when the SEDs are known and this method can be extended to statistics that are correlated16. Also, it should be remembered that among the possible choices for phase uncertainty SLDm is the least appropriate when the markers are not dense because it is then the most correlated with the statistics that use the other types of information (see Figure 7). If the disease mode of inheritance is known, in the case of a recessive disease we suggest combining the p-values from Sa, SHWD and SLDc. For a dominant disease, we should combine the p-values from Sa, SHWD and SLDw (or SLDm) when markers are dense, and the p-values from Sa, SHWD and SLDc when the markers are not dense. For multiplicative and additive disease modes of inheritance, we should consider Sa and SLDm (or SLDw) when the markers are dense, and Sa and SLDc when the markers are not dense. If the disease mode of inheritance is unknown, we should first decide on the mode of inheritance using SHWD [Wittke-Thompson et al., 2005; Zheng and Ng, 2008], and then combine Sa and SLDm (or SLDw), or Sa and SLDc according to the disease mode of inheritance and marker density (Figure 10 shows the SED as a function of λ1).

Figure 10
SED as a function of λ1

However, in spite of the efficiency and validity of the proposed method, there are still problems that need attention. First, the suggested strategy that determines the disease mode of inheritance with SHWD does not use SHWD to indicate association per se. As has been shown, the most powerful method is attainable only when the SEDs of the statistics are known; the SED for IHWD needs information about the disease mode of inheritance. The new method of determining the disease mode of inheritance with type I error preserved requires further investigation. Second, we need to have estimates of the covariances to combine the three types of information if HWE in the population is violated. If HWE in the population may be assumed, the covariances can be estimated from the equivalence of the composite LD and LD. However, under HWD the equivalence is violated and even more parameters need to be estimated that require information about phase.

In summary, even though there have now been many investigations using GWA, except for genetic mapping based on linkage disequilibrium units [Collins and Lau, 2008], they have so far considered only the statistic based on Ia for an initial scan, which can be powerless in some cases. Also, application to real marker data indicates that the LD contrast test can be very informative in some situations, so that combining p-values from Sa and SLDc can increase the statistical power of GWA studies. The proposed method is optimal and efficient from the viewpoint of using all the information available.

Acknowledgements

This work was supported in part by a U.S. Public Health Service Resource grant (RR03655) from the National Center for Research Resources, Research grant (GM28356) from the National Institute of General Medical Sciences and Cancer Center Support Grant (P30CAD43703) from the National Cancer Institute.

Appendix

1. Notation

Before we derive the expectations, variances and covariances, we define the following notation that is used in Weir [Weir, 1996]:

ΔA1B1pA1B1pA1pB1=pA1pB1+d14ABd23ABpA1pB1=d14ABd23ABdA1A1B1pA1B1A1B1+12pA1B2A1B1pA1ΔA1B1CpB1dApA12pB1dA1B1B1pA1B1A1B1+12pA2B1A1B1pB1ΔA1B1CpA1dBpA1pB12=d12AB(1pA1)+d14AB(1pA1pB1)+d23AB(pB1pA1)d34ABpA1dA1B1A1B1pA1B1A1B12pA1dA1B1B12pB1dA1A1B12pA1pB1ΔA1B1CpA12dBpB12dAΔA1B12ΔA1B12dAdBpA12pB12=d12AB+d13AB+d14AB+2pA1dA1B1B12pB1dA1A1B12pA1pB1ΔA1B1CpA12dBpB12dAΔA1B12dAdB,

where it should be noted that ΔA1B1c=ΔA1B1+ΔA1B1. Under HWE, ΔA1B1=dA1A1B1=dA1B1B1=0 because dA1A1B1=d13AB(1pB1)+d14AB(1pA1pB1)+d23AB(pA1pB1)d24ABpB1, dA1B1B1=d12AB(1pA1)+d14AB(1pA1pB1)+d23AB(pB1pA1)d34ABpA1.

2. Results under HWD

The variances of Δ^A1B1 and Δ^A1B1C were derived as follows [Weir, 1996]:

var(Δ^A1B1)=12N[pA1(1pA1)pB1(1pB1)+(12pA1)(12pB1)ΔA1B1ΔA1B12][+dAdB+ΔA1B12+dA1B1A1B1]var(Δ^A1B1C)=1N[pA1B1A1B1+pA1pB1(1pA1pB1)+pA1(12pA1)dB+pB1(12pB1)dA+(14pA1)dA1B1B1][+(14pB1)dA1A1B1+12(12pA12pB1)ΔA1B1CΔA1B1C2]

With results from Won and Elston [Won and Elston, 2008], we have

E(Δ^A1B1)=ΔA1B1+pA1B112N2E(Σi=1NXi1)(Σi=1NYi1)12N2E(Σi=1NXi1)(Σi=1NYi2)=ΔA1B1ΔA1B1+d14ABd23AB2N,E(Δ^A1B1C)=E(Δ^A1B1+12NΣi=1N[Xi1Yi2+Xi2Yi1]p^A1p^B1)=ΔA1B1+d14ABd23ABΔA1B1+d14ABd23ABN,cov(p^A1,Δ^A1B1C)=12N[2pB1dA+(d14ABd23AB)+2d13AB+2d14AB4pA1(d14ABd23AB)][+ΔA1B1(12pA1)]+O(1N2),cov(d^A,Δ^A1B1C)=1N[2pB1dA(12pA1)+2(d13AB+d14AB)(12pA1)+(d14AB23AB)(pA1+3pA12dA)][+ΔA1B1(pA1pA12dA)]+O(1N2).

.

REFERENCES

  • Collins A, Lau W. CHROMOSCAN: genome-wide association using a linkage disequilibrium map. J Hum Genet. 2008;53:121–126. [PubMed]
  • Excoffier L, Slatkin M. Maximum-likelihood estimation of molecular haplotype frequencies in a diploid population. Mol Biol Evol. 1995;12:921–927. [PubMed]
  • Fisher RA. Statistical Methods for Research Workers. 11th ed. Oliver & Boyd; Edinburgh: 1950.
  • Lewontin RC. The Interaction of Selection and Linkage. I. General Considerations; Heterotic Models. Genetics. 1964;49:49–67. [PubMed]
  • Liptak T. On the combination of independent tests. Magyar Tud.Akad.Mat.Kutato' Int.Ko"zl. 1958;3:171–197.
  • Nielsen DM, Ehm MG, Weir BS. Detecting marker-disease association by testing for Hardy-Weinberg disequilibrium at a marker locus. Am J Hum Genet. 1998;63:1531–1540. [PubMed]
  • Nielsen DM, Ehm MG, Zaykin DV, Weir BS. Effect of two- and three-locus linkage disequilibrium on the power to detect marker/phenotype associations. Genetics. 2004;168:1029–1040. [PubMed]
  • S.A.G.E. 5.2 Statistical Analysis for Genetic Epidemiology. 2007. http://darwin.cwru.edu/sage/
  • Sasieni PD. From genotypes to genes: doubling the sample size. Biometrics. 1997;53:1253–1261. [PubMed]
  • Song K, Elston RC. A powerful method of combining measures of association and Hardy-Weinberg disequilibrium for fine-mapping in case-control studies. Stat in Med. 2006;25:105–126. [PubMed]
  • Van Steen K, McQueen MB, Herbert A, Raby B, Lyon H, DeMeo DL, Murphy A, Su J, Datta S, Rosenow C, et al. Genomic screening and replication using the same data set in family-based association testing. Nat Genet. 2005;37:683–691. [PubMed]
  • Wang T, Zhu W, Elston RC. Improving power in contrasting linkage-disequilibrium patterns between cases and controls. Am J Hum Genet. 2007;80:911–920. [PubMed]
  • Weir BS. Genetic Data Analysis II. Sinauer Associates; Sunderland, MA: 1996.
  • Wittke-Thompson JK, Pluzhnikov A, Cox NJ. Rational inferences about departures from Hardy-Weinberg equilibrium. Am J Hum Genet. 2005;76:967–986. [PubMed]
  • Won S, Elston RC. The power of independent types of genetic information to detect association in a case-control study design. Genet Epidemiol. 2008 In press. [PubMed]
  • Won S, Morris N, Lu Q, Elston RC. Choosing optimal method for combining p-values. Stat in Med. 2008 Submitted.
  • Zaykin DV. Bounds and normalization of the composite linkage disequilibrium coefficient. Genet Epidemiol. 2004;27:252–257. [PubMed]
  • Zaykin DV, Meng Z, Ehm MG. Contrasting linkage-disequilibrium patterns between cases and controls as a novel association-mapping method. Am J Hum Genet. 2006;78:737–746. [PubMed]
  • Zheng G, Song K, Elston RC. Adaptive two-stage analysis of genetic association in case-control designs. Hum Hered. 2007;63:175–186. [PubMed]
  • Zheng G, Ng HK. Genetic model selection in two-phase analysis for case control association studies. Biostatistics. 2008;9:391–399. [PMC free article] [PubMed]