Home | About | Journals | Submit | Contact Us | Français |

**|**HHS Author Manuscripts**|**PMC2732754

Formats

Article sections

Authors

Related links

Genet Epidemiol. Author manuscript; available in PMC 2010 September 1.

Published in final edited form as:

Genet Epidemiol. 2009 September; 33(6): 497–507.

doi: 10.1002/gepi.20402PMCID: PMC2732754

NIHMSID: NIHMS93959

Wei Pan, Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, MN 55455;

Correspondence author: Wei Pan, Telephone: (612) 626-2705, Fax: (612) 626-0660, Email: ude.nmu.tatsoib@piew, Address: Division of Biostatistics, MMC 303, School of Public Health, University of Minnesota, Minneapolis, Minnesota 55455 0392, U.S.A

See other articles in PMC that cite the published article.

We consider detecting associations between a trait and multiple SNPs in linkage disequilibrium (LD). To maximize the use of information contained in multiple SNPs while minimizing the cost of large degrees of freedom (DF) in testing multiple parameters, we first theoretically explore the sum test derived under a working assumption of a common association strength between the trait and each SNP, testing on the corresponding parameter with only one DF. Under the scenarios that the association strengths between the trait and the SNPs are close to each other (and in the same direction), as considered by Wang and Elston (2007), we show with simulated data that the sum test was powerful as compared to several existing tests; otherwise, the sum test might have much reduced power. To overcome the limitation of the sum test, based on our theoretical analysis of the sum test, we propose five new tests that are closely related to each other and are shown to consistently perform similarly well across a wide range of scenarios. We point out the close connection of the proposed tests to the Goeman test. Furthermore, we derive the asymptotic distributions of the proposed tests so that p-values can be easily calculated, in contrast to the use of computationally demanding permutations or simulations for the Goeman test. A distinguishing feature of the five new tests is their use of a diagonal working covariance matrix, rather than a full covariance matrix as used in the usual Wald or score test. We recommend the routine use of two of the new tests, along with several other tests, to detect disease associations with multiple linked SNPs.

Due to the rapid development of large-scale genotyping technologies, it has become feasible to conduct genome-wide association studies (GWASs); in fact, quite a few GWASs have been completed while many others are underway. However, a remaining challenge is how to efficiently analyze data collected from GWASs. This is critical given that most complex diseases or traits are only weakly associated with causal genetic variants. The power of GWASs depends on linkage disequilibrium (LD) between DNA markers (e.g., genotyped SNPs) and causal loci that often are not genotyped: if there exists a causal locus in a DNA region, due to LD, it is likely that one or more SNPs nearby will be associated with the trait too. Hence, it is intuitively reasonable to consider multiple loci in LD in a region; this is the problem we consider here. Roughly speaking, most existing analysis methods fall into one of the two categories: single-locus versus multilocus approaches. The basic idea of the former is to analyze single SNPs one at a time, then combine such individual analyses (Roeder et al 2005 and references therein). For these methods, in addition to possibly failing to utilize information contained in high-order relationships among multiloci, another shortcoming is their reduced power due to the high cost of adjustments for multiple testing that are required to control family-wise error rates or false discovery rates at some nominal levels. On the other hand, to overcome the above two problems with single-locus approaches, multilocus methods jointly model the relationship between a trait and multiloci (e.g., Chapman et al 2003; Qin et al 2002; Schaid et al 2002; Stephens et al 2001; Wei et al 2008; Zhao et al 2003a, b, and references therein). However, there is also an inherent cost of large degrees of freedom (DF) when testing over multiple parameters involved in a joint model.

Wang and Elston (2007) fully recognized the two possibly conflicting goals of capturing information contained in multiloci in LD and reducing the cost of multiple testing or large DF, and thus proposed a weighted score test (WST) to achieve the two goals at the same time: although multiloci are involved, the DF of the WST is only 1. They used simulated data under various scenarios, including mimicking real LD patterns as drawn from HapMap data, to demonstrate improved power of the WST over other existing methods. Here we first study an alternative approach called the sum test in the same context: it is based on the familiar treatment of the trait as the response variable and modeling its mean (e.g. probability of having disease) in terms of multiple SNPs in LD in logistic regression or any generalized linear model (GLM). The basic idea is to test on only one parameter under a working assumption of a common association strength between each of multiple SNPs and the trait. Although in general the association strengths may vary with SNPs, the proposed methods can be more powerful than the WST and several other existing methods while controlling the test size at a specified nominal significance level; similar to the WST, the key is to utilize multiloci while controlling the DF at the minimum 1. Simply put, rather than testing on multiple parameters either individually (followed by a multiple test adjustment as in single-locus approaches) or jointly (with large DF as in most multilocus approaches), the sum test focuses on a scalar function of the multiple parameters with the resulting DF=1.

An original motivation of this work was to explore how the WST works. The numerical results of Chapman and Whittaker (2008) indicate that, if some of the coded SNPs are positively correlated while others are negatively correlated, the WST and the sum test may have low power. Here we exploit theoretical properties of the sum test. In particular, we provide a theoretical explanation for the possible failure of the sum test: if the association strengths between the outcome and the SNPs are not close, especially if some are in opposite directions, the sum test may have low power; this explanation differs from the heuristic argument given by Chapman and Whittaker (2008) that is based on the correlations among the coded SNPs. In addition, the theoretical result motivated the development of four new tests: two of them are asymptotically equivalent to an estimated most powerful test, and are weighted versions of the other two that are closely related to Goeman’s test (Goeman et al 2006), which was found to perform well across a wide range of scenarios by Chapman and Whittaker (2008). The proposed tests are all based on a sum of squared marginal regression coefficient estimates or of squared score statistics, possibly weighted inversely by the variances of the coefficient estimates or score statistics. In contrast to the use of computationally intensive permutations or simulations as for Goeman’s test, we derive asymptotic distributions of the proposed tests, easily yielding p-values. In spite of their simplicity, our numerical studies showed that the proposed tests performed similarly to Goeman’s test. A distinguishing feature of the new tests and Goeman’s test is their use of a diagonal working covariance matrix, rather than a full covariance matrix as used in the usual Wald or score test. We offer some intuitive explanations on the possibly improved power of the new tests over the usual Wald or score test, illustrating the well known fact that there is no uniformly most powerful test in the current context.

Suppose that we have *m* independent observations (*Y _{i}*,

$$h[E({Y}_{i})]={\beta}_{0}+\sum _{j=1}^{k}{X}_{ij}{\beta}_{j}={\beta}_{0}+{X}_{i}\beta ,$$

where *E*(*Y _{i}*) is the mean of

$$\text{Logit}Pr({Y}_{i}=1)={\beta}_{0}+\sum _{j=1}^{k}{X}_{ij}{\beta}_{j},$$

(1)

where *Y _{i}* = 0 or 1 indicates whether subject

In contrast to the global test, another extreme is to conduct individual SNP-by-SNP tests: rather than including all the *k* SNPs, we include only one SNP in

$$\text{Logit}Pr({Y}_{i}=1)={\beta}_{M,0j}+{X}_{ij}{\beta}_{M,j},$$

(2)

where we explicitly distinguish *β _{M}* = (

Most of the existing methods fall into one of the above two extreme categories. For example, the generalized Hotelling’s *T*^{2} test (Fan and Knapp 2003; Xiong et al 2002) and haplotype analyses (e.g., Chapman et al 2003; Schaid et al 2002; Stephens et al 2001; Zhao et al 2003a, b, and references therein) belong to the first category for their joint testing on multiple parameters simultaneously, while other univariate approaches build on SNP-by-SNP tests (Roeder et al 2004 and references therein). The limitations of the two extremes have been well recognized and motivated the development of new methods. A general strategy is to capture full information in all *β _{j}*’s while reducing the cost of large DF or adjustment for multiple testing, but the key is how to reconcile the two possibly conflicting goals. Wang and Elston (2007) proposed such an approach, called weighted score test (WST): the WST puts higher weights on more important components of Fourier-transformed genotypes and minimizes the DF at one. Below we first study a simple alternative to WST, the sum test; a theoretical analysis clearly indicates a possible limitation of the sum test, which further motivates the development of our new tests.

We first formulate the sum test from a new perspective of striking a compromise between joint modeling and its resulting large DF: while all the SNPs in LD are to be used, we make a key and possibly incorrect working assumption that the SNPs are all equally associated with the trait; that is, rather than unnecessarily aiming to estimate separate *β _{j}*’s in (1), we use a common

$$\text{Logit}Pr({Y}_{i}=1)={\beta}_{c0}+\sum _{j=1}^{k}{X}_{ij}{\beta}_{c}.$$

(3)

To address the question of whether there is any association between the trait *Y _{i}* and any SNP, we test

Below we explore the advantages and limitations of the sum test. Intuitively, because it tests on only one parameter *β _{c}*, there will be no power loss due to large DF (or multiple testing adjustment). Generally, the common association parameter

$${\widehat{\beta}}_{c}=\frac{{\sum}_{j=1}^{k}{\sum}_{i=1}^{m}{X}_{ij}^{2}{\widehat{\beta}}_{M,j}}{{\sum}_{i=1}^{m}{\left({\sum}_{j=1}^{k}{X}_{ij}\right)}^{2}},$$

(4)

where * _{c}* is the MLE of

Under the global *H*_{0}: *β _{M}* = 0, it is trivial to see that

The problem with the sum test is that, as shown in equation (4), because the estimated common effect * _{c}* is a linear combination of the components of

$$\mathit{SumSqB}={\widehat{\beta}}_{M}^{\prime}{\widehat{\beta}}_{M}=\sum _{j=1}^{k}{\widehat{\beta}}_{M,j}^{2},$$

or, its weighted version with a weight assigned to each component of * _{M}* based on its variance:

$$\mathit{SumSqB}w={\widehat{\beta}}_{M}^{\prime}\text{Diag}{({V}_{M})}^{-1}{\widehat{\beta}}_{M}=\sum _{j=1}^{k}{\widehat{\beta}}_{M,j}^{2}/{v}_{M,j},$$

where Diag(*V _{M}*) =

Below we describe the asymptotic distributions of the above two test statistics. Based on the theory of generalized estimating equations as shown in Appendix D, under *H*_{0}: (*β _{M}* = 0,

$$a=\frac{{\sum}_{j=1}^{k}{c}_{j}^{3}}{{\sum}_{j=1}^{k}{c}_{j}^{2}},\phantom{\rule{0.38889em}{0ex}}b=\sum _{j=1}^{k}{c}_{j}-\frac{{\left({\sum}_{j=1}^{k}{c}_{j}^{2}\right)}^{2}}{{\sum}_{j=1}^{k}{c}_{j}^{3}},\phantom{\rule{0.38889em}{0ex}}d=\frac{{\left({\sum}_{j=1}^{k}{c}_{j}^{2}\right)}^{3}}{{\left({\sum}_{j=1}^{k}{c}_{j}^{3}\right)}^{2}}.$$

To calculate a p-value, for example for the SumSqB test,

$$Pr(\mathit{SumSqB}>s\mid {H}_{0})\approx Pr({\chi}_{d}^{2}>(s-b)/a).$$

Note that in the above two new tests, we used * _{M}* from marginal models (2), not from joint model (1). Intuitively, because of collinearity, the components of have much larger variances than their counter parts of

By the asymptotic connection between the MLE and the score statistic (Cox and Hinkley 1974, p.315), mimicking SumSqB and SumSqBw, we can construct a modified score test based on the marginal model (2). It is easy to verify that the marginal score statistic for *β _{M,j}* under

$${U}_{M,j}=\sum _{i=1}^{m}{X}_{ij}({Y}_{i}-\overline{Y})={X}_{.j}^{\prime}(Y-\overline{Y}),$$

where
$\overline{Y}={\sum}_{i=1}^{m}{Y}_{i}/m$ and 1 is a column vector with elements all 1’s. Denote *U _{M}* = (

$$\mathit{SumSqU}={U}_{M}^{\prime}{U}_{M}=(Y-\overline{Y})\prime X{X}^{\prime}(Y-\overline{Y}),$$

where *Y* and *X* are the response vector and the design matrix respectively. A weighted form is

$$\mathit{SumSqU}w={U}_{M}^{\prime}\text{Diag}{({I}_{f})}^{-1}{U}_{M},$$

where *I _{f}* =

As for SumSqB and SumSqBw, the null distributions of the two test statistics have quadratic forms and can be approximated by
$a{\chi}_{d}^{2}+b$. Furthermore, as for SumSqB and SumSqBw, the above two test statistics differ from the usual score test statistic
${U}_{M}^{\prime}{I}_{f}^{-1}{U}_{M}$ with *I _{f}* =

Note that the SumSqU test itself can be motivated as a modification to the sum test. It is easy to derive that the score statistic of the sum test for *H*_{0}: *β _{c}* = 0 in model (3) is
${U}_{c}={\sum}_{j=1}^{k}{U}_{M,j}$, which may be sensitive to varying signs of the components of

Below we propose a new test by estimating the most powerful test, which is to be proved to be asymptotically equivalent to the SumSqBw and SumSqUw tests, serving to unify the five new tests proposed here.

It is well known that there is no uniformly most powerful unbiased (UMPU) test on multiple parameters (Cox and Hinkley 1974). On the other hand, under local alternatives
$\beta =0+\delta b/\sqrt{m}$ with *b* as a fixed vector and *δ* a scalar, the most powerful test is based on the test statistic *T _{MP}* =

$${T}_{\mathit{EMP}}={\widehat{\beta}}_{M}^{\prime}{U}_{M}.$$

By a standard Taylor expansion of the marginal score equation *U _{M,j}*(

$${\widehat{\beta}}_{M}={I}_{M,d}^{-1}{U}_{M}+{O}_{p}({m}^{-1}),$$

(5)

where *I _{M,d}* is a diagonal matrix with −

$${T}_{\mathit{EMP}}\approx {U}_{M}^{\prime}{I}_{M,d}{U}_{M}.$$

As before we approximate the asymptotic distribution of *T _{EMP}* using a scaled and shifted chi-squared distribution.

By the relationship between an observed and an expected information matrix, we have *I _{M,d}* ≈ Diag(

$${U}_{M}^{\prime}{I}_{M,d}{U}_{M}\approx {U}_{M}^{\prime}\text{Diag}{({I}_{f})}^{-1}{U}_{M}=\mathit{SumSqU}w,$$

which suggests that the SumSqUw is asymptotically equivalent to the estimated most powerful (EMP) test. Similarly, using (5), it is easy to verify that SumSqBw, SumSqUw and EMP are all asymptotically equivalent.

Equation (5) also directly suggests an estimator for *Cov*(* _{M}*) as used in deriving the asymptotic distributions of SumSqB and SumSqBw:

$$\mathit{Cov}({\widehat{\beta}}_{M})={I}_{M,d}^{-1}\mathit{Cov}({U}_{M}){I}_{M,d}^{-1}={I}_{M,d}^{-1}{I}_{f}{I}_{M,d}^{-1},$$

which is the sandwich (also called robust or empirical) covariance estimator used in the GEE for * _{M}* (Appendix D).

Interestingly, because the score statistic of the sum test for *H*_{0}: *β _{c}* = 0 in model (3) is
${U}_{c}={\sum}_{j=1}^{k}{U}_{M,j}={1}^{\prime}{U}_{M}$, the sum test can be viewed as an estimated most powerful test by estimating

Finally, we note that the usual joint score test, and thus its asymptotically equivalent LRT and Wald test, can be also regarded as estimated most powerful tests. If we use , instead of * _{M}* of

Goeman et al (2006) proposed a general empirical Bayes method to test on a large number of parameters, as applicable to the *β _{j}*’s in the joint logistic regression model (1). A key idea is to treat

$$\begin{array}{l}{T}_{Go}=\frac{1}{2}({U}^{\prime}U-\text{Trace}({I}_{f}))\\ =\frac{1}{2}((Y-\overline{Y})\prime X{X}^{\prime}(Y-\overline{Y})-\overline{Y}(1-\overline{Y})\text{Trace}((X-\overline{X})\prime (X-\overline{X}))),\end{array}$$

where *U* is the score statistic for *β* under *H*_{0}: *β* = 0 in joint model (1), and *I _{f}* is the corresponding Fisher information matrix:

The null distribution of *T _{GO}* is unknown and has to be estimated by permutation or simulation, which is computationally intensive.

As noted by Chapman and Whittaker (2008), given *Y* (as under permutation), the second term of *T _{GO}* is non-random while the first term has a quadratic form of asymptotic Normal variates. Noting that

The Goeman test was originally proposed for a *large* number of parameters with a relatively small sample size, which is not true in the current context: we have a sample size of hundreds to thousands while the number of parameters is less than 100, e.g. only around 20 in our examples. Hence, the good performance of the Goeman test in the current context as obtained by Chapman and Whittaker (2008) and confirmed here later is somewhat puzzling. Here we offer an explanation. By the specified prior distribution of *β _{j}*’s as iid with mean 0 and variance

We performed simulation studies by largely following the set-ups given in Wang and Elston (2007) with *k* = 4, 10 and 20 marker SNPs, and sample size *n* = 100, 200 and 500. The disease-causing SNP was assumed to be in the center of the marker SNPs, but was removed from the data. First, we generated a latent vector from a multivariate normal distribution with one of three covariance structures: a compound symmetry (CS) with an equal pairwise correlation *ρ* = 0.4, an AR-1 with the correlation *ρ _{ij}* = 0.8

$$\text{Logit}Pr({Y}_{i}=1)={\beta}_{0}+log(OR){X}_{i0},$$

(6)

where we chose *β*_{0} = −log 4 to give a background (i.e. not caused by the SNP) disease probability of 0.2, and the odds ratio (OR) ranged from 1 (i.e. no association) to 2. Finally, following the case-control design, we sampled *n* cases (with *Y _{i}* = 1) and

In addition to the sum test and the weighted score test, we also considered a few other ones: the first (denoted as L-G) was a global LRT on *H*_{0}: *β* = 0 based on the joint model (1); the second was the generalized Hotelling’s *T*^{2} test; third, we included the univariate test and Goeman’s test based on permutations, U-P and Go-P respectively; finally, we had our five asymptotic tests based on the sum of (weighted) squares of the components of * _{M}* or of the score statistics.

We considered *k* = 4, 10 and 20 marker SNPs, sample size *n* = 100, 200 and 500, MAF=0.2, 0.3 and 0.4 for the disease-causing SNP, and used nominal significance levels *α* = 0.05 and *α* = 0.01; because the conclusions were similar, to save space, we only gave results for *k* = 10, *n* = 500, MAF=0.2 and *α* = 0.05.

First, with the CS correlation structure suggesting all the marker SNPs contained the same amount of information about the disease-causing SNP, as shown in Table 1, it is obvious that the sum test and the WST were the winners, closely followed by the Goeman test and our proposed five tests. Second, for the AR-1 correlation structure (Table 2), the sum test, the Goeman test and our proposed five tests performed similarly and better than other methods. Third, with random pairwise SNP correlations (Table 3), the conclusion was similar to that for the CS: the sum test and WST performed best, having a slight edge over the Goeman test and our proposed five tests. Fourth, across all the scenarios, our proposed five tests performed similarly to each other and to the Goeman test, which however depends on the use of computationally intensive permutations or simulations. Note that for the AR-1 or the random correlation structure, pairwise correlations *ρ _{ij}* were not constant, representing a situation in which the working assumption of “a common association strength” underlying the sum test is violated; nevertheless, the key for the good performance of the sum test (and WST) is that the associations between the response and SNPs were (mostly) in the same direction.

Empirical sizes and powers of various tests with nominal significance level *α* = 0.05 for simulated data based on the real LD pattern of gene CHI3L2 with MAF> 0.2 (#SNP=16).

As in Wang and Elston, we conducted a simulation study based on real LD patterns within the CHI3L2 gene as observed in HapMap data. We downloaded the SNPs of the CHI3L2 gene for the 90 CEU (Utah residents with ancestry from northern and western Europe) individuals from the HapMap site in June 2008. As in Wang and Elston (2007), first, we excluded SNPs with MAF ≤ 0.2, leaving 23 SNPs. Second, we did a single imputation for each of the missing genotypes by randomly drawing an observed genotype of the same SNP. Third, we used the dosage coding for the SNPs and tried to minimize the number of negative correlations among them. Fourth, we deleted redundant SNPs that were perfectly correlated with other SNPs. There was still substantial LD among the remaining 17 SNPs as indicated by the distribution of their pairwise Pearson’s correlation coefficients, which ranged from −0.388 to 0.989 with the three quartiles *Q*_{1}, *Q*_{2} and *Q*_{3} as 0.364, 0.544 and 0.738. Fifth, we repeatedly sampled (with replacement) subjects from the 90 CEU individuals. Finally, as Wang and Elston, we chose the SNP rs2182114 as disease-causing and generated disease indicators from the logistic regression model (6) with the same *β*_{0} and four possible values of OR. The results for two sample sizes are shown in Table 2. It is clear that the sum test, Goeman’s test and our proposed five tests performed similarly and were the winners; in particular, they could have an extra 3% power over that of WTS or the univariate test. The global LRT and the Hotelling’s *T*^{2} test had low power, and even possibly inflated test sizes.

For the CHI3L2 gene, we also considered including all the SNPs with MAF> 0.05, rather than MAF> 0.2, and followed exactly the same steps, including choosing SNP rs2182114 as disease-causing. In this way, we ended up with 23 SNPs. Among the 253 SNP pairs, we had 42 pairs with negative correlation coefficients. The distribution of the correlation coefficients for all the pairs was summarized below: min= −0.388, the first quartile *Q*_{1} = 0.142, *Q*_{2} = 0.375, *Q*_{3} = 0.602, and max= 0.989. Hence, compared to that with only 17 SNPs, now some SNPs had weaker correlations, possibly leading to the power loss for the sum test (Table 3). The Goeman test and our proposed tests seemed to be the winners, slightly more powerful than the univariate test, then closely followed by the WST. The other two tests again had much lower power. It seems that the weighted version SumSqBw had a slight edge over SumSqB, albeit not necessarily significant; in contrast, the power difference between SumSqUw and SumSqU was hardly visible.

As in Chapman and Whittaker (2008), we also considered the region of gene IL21R, in which LD was low. We followed exactly the same steps as for gene CHI3L2 except that as in Chapman and Whittaker (2008), the disease-causing SNP was selected *randomly* and then excluded from the data in each simulation run. At the end, we had 28 SNPs. There were 54 SNP pairs with negative correlation coefficients. A summary of the correlation coefficients among all the SNP pairs was the following: min= −0.564, *Q*_{1} = 0.097, *Q*_{2} = 0.212, *Q*_{3} = 0.459, and max= 0.991. Hence, the LD pattern in this region was more extreme with some negative or low pairwise correlations, which might explain why the power of the sum test (and WST) was much lower than Goeman’s test, our proposed five tests and the univariate test (Table 4). It is interesting to note that the univariate test appeared to be the winner, more powerful than Goeman’s test and our proposed five tests in this case.

We have two immediate observations based on the above numerical studies. First, there is no uniform winner, though often Goeman’s test and our proposed five tests have high power and sometimes, the sum test is most powerful. In fact, when testing on multiple parameters, it is well known that no (asymptotically) uniformly most powerful unbiased (UMPU) test exists; if and only if the true value of the parameter vector is known, then an asymptotically UMPU test can be constructed (Cox and Hinkley 1974, p.319). In practice, of course, we do not know the true values of the parameters and hence have no UMPU test, suggesting that, without much prior knowledge, we perhaps should try more than one test. Second, it may be surprising that our proposed five tests can be more powerful than the traditional Wald, score or likelihood ratio test. To be concrete, we contrast the SumSqB test and the standard Wald test; the former uses an identity matrix, rather than a covariance estimate as used in the latter. Intuitively it seems that using a good covariance matrix as in the Wald test should be *always* more productive, but our numerical results suggest otherwise. We first thought that it might be due to bad covariance estimates, which however was not likely given that we had a large sample size and relatively a much smaller number of parameters. Below we offer an explanation under a simplified scenario, which in addition vividly illustrates the above point that the power of a test depends on the true parameter values.

We consider *k* = 2 parameters, *β* = (*β*_{1}, *β*_{2})′, and its estimate ~ *N*(*β*, *V*) with *V* known: *Var*(_{1}) = *Var*(_{2}) = 1/500 and *corr*(_{1}, _{2}) = *ρ.* To test *H*_{0}: *β* = 0, we can compare the following four tests: the Wald test with a test statistic ′*V*^{−l}, the SumSqB test with *SumSqB* = ′, a univariate test with *Max* = max(|_{1}|, |_{2}|), and a simplified sum test with *Sum* = _{1} + _{2}. The Max test is similar to the U-P test except that the multiple testing on the components of *β* is analytically adjusted, rather than by permutation. The previous sum test (4) is simplified to the current special case if each covariate is standardized to have the same sample variance 1. For any of the above four test with test statistic *T*(*β*), the rejection region of the test is simply *R*(*β*) = {*β*: |*T*(*β*)| > *c*}. The constant *c* can be obtained either analytically or numerically (Conneely and Boehnke 2007) to guarantee that ∫_{R}_{(}_{β}_{)}*f*_{0}(*β*)*dβ*_{1}*dβ*_{2} = 1−*α*=0.95, where *f*_{0}(*β*) is the density function of under *H*_{0}, i.e., that of a bivariate *N*(0, *V*).

Figure 1 plots the rejection regions of the four tests with two different values of *ρ* = 0.3 and *ρ* = 0.7. In addition, 1000 random variates drawn from *N*(*β*, *V*) are also plotted on each panel. Three non-zero *β* values were considered: *β* = (0, 0.05)′, (0.05, 0.05) or (−0.05, 0.05). Table 5 gives the empirical power of each test under each set-up. It is most illustrative to notice the varying rejection regions of the tests, which determine the power of the tests. For example, along the identity (i.e. 45-degree) line, the rejection region of the Wald test covers that of the other tests while the sum test gives the smallest rejection region, suggesting that among the four tests, the Wald and sum tests are respectively the least and most powerful when *β*_{1} ≈ *β*_{2} ≠ 0; on the other hand, if *β*_{1} ≈ − *β*_{2} ≠ 0, the roles of the Wald and sum tests are reversed, as confirmed by Table 5. In particular, it is clear that the use of the (true!) covariance matrix of in the Wald test may lose power as compared to the SumSqB test (and sum test) if the components of the true *β* are indeed close to each other; furthermore, the power difference increases with the correlation between the components.

We have studied association mapping with multiple SNPs, possibly in LD. These SNPs may form an LD block, or more generally fall within a sliding window in a GWA scan. All the tests can be equally applied when other covariates are included in a regression model. We have mainly studied two ways of boosting statistical power: one is through dimension reduction as done in the sum test, while the other is to ignore possible correlations among the parameter estimates or score statistics as in the class of the SumSq tests. Interestingly, both the sum test and the class of the SumSq tests, along with the standard joint score test (and its asymptotically equivalent Wald test and LRT), can be all regarded as *estimated* most powerful tests with various estimates of the true association parameters.

A surprising result of this study is the confirmed good performance of the sum test in detecting possible associations between a trait and multiple SNPs in LD under some situations. The basic idea underlying the sum test is to adopt a working assumption on the existence of a common association strength between the trait and each of multiple SNPs. Intuitively, if any of the SNPs is associated with the trait, due to LD, other correlated SNPs should also be associated with the trait; even if their association strengths vary, their “average” (or more generally, some function of them) is likely to differ from 0, thus motivating the sum test with DF=1 on this “average” effect. Under situations with both positive and negative SNP associations with the trait, resulting in that the targeted “average” effect is much weaker than some individual effects, the sum test, as WST, may not be powerful. Largely for this reason, Chapman and Whittaker (2008) did not recommend the use of the sum test. However, because the sum test, as any SumSq test, is so easy and general to use with applicability to either continuous or discrete traits with or without other covariates, and to other study designs as long as a GLM can be employed, we do not dismiss its use. The good performance of the sum test is not limited to simulated data: Zhong and Pan (2008) applied it to a GAW16 Rheumatoid Arthritis (RA) dataset (Plenge et al 2007); for the four LD blocks containing well known RA-associated loci, among the sum test, Goeman’s test and univariate test, each test won in one case while all three performed similarly well in the fourth case (with empirical power close 1 due to too strong signals). Nevertheless, as shown here and in Zhong and Pan (2008), the performance of the sum test depends on the coding of SNPs in an unknown way; further studies are needed to identify optimal SNP codings for the sum test for any given dataset.

Our proposed five tests all have a quadratic form as a sum of squares of (weighted) marginal coefficient estimates or score statistics. It is perhaps surprising that, though the covariance matrix of the marginal coefficient estimates or score statistics can be consistently estimated, the new tests use only an identity matrix or only the diagonal elements of the corresponding covariance matrix, deviating from the standard Wald test and score test. We tried with the use of the full covariance matrices, and the results (not shown) were not good; in fact, the resulting Wald or score test was asymptotically equivalent to the LRT (L-G) for the joint model (1), which did not perform well as shown in our numerical examples. We emphasize that the possibly improved (or degraded) performance of our proposed tests over the standard LRT, Wald or score test should generally hold for any regression models, even with the presence of other covariates. Rather than using a sum of squares, we also tried a sum of absolute values of (weighted) marginal coefficient estimates; it performed no better than the sum of squares (results not shown). Importantly, all of the proposed five tests can be conducted by recoursing to their asymptotic null distributions, avoiding the use of time-consuming permutations or simulations as for Goeman’s test. We also compared the use of the asymptotic null distributions of the proposed tests to that of their permutational distributions, leading to similar results (not shown). Because of the consistently good and similar performance of Goeman’s test and our proposed tests, and computational advantages of our proposed tests, especially with the sum of the squared score statistics, over Goeman’s test, we recommend the routine use of SumSqU and SumSqUw, along with the sum test, the global LRT (or Wald or score test) and the univariate test (U-P) for the lack of any uniformly most powerful test.

This research was partially supported by NIH grants GM081535 and HL65462. The author is grateful to Drs Tao Wang and Rob Elston for sharing their data and computer program, to Fang Han for help with Figure 1, and to Dr Melanie Wall for helpful discussions that motivated the technical development in Appendices A–C. The author thanks the reviewers for helpful comments.

Denote *Y* = (*Y*_{1}, …, *Y _{m}*)′,

$${\widehat{\beta}}_{c}={({X}_{c}^{\prime}{X}_{c})}^{-1}{X}_{c}^{\prime}Y={({X}_{c}^{\prime}{X}_{c})}^{-1}{1}^{\prime}{X}^{\prime}Y={({X}_{c}^{\prime}{X}_{c})}^{-1}{1}^{\prime}({X}^{\prime}X)\widehat{\beta},$$

where ${({X}_{c}^{\prime}{X}_{c})}^{-1}{1}^{\prime}({X}^{\prime}X)$ is a row vector with the sum of its elements being

$${({X}_{c}^{\prime}{X}_{c})}^{-1}{1}^{\prime}({X}^{\prime}X)1={({X}_{c}^{\prime}{X}_{c})}^{-1}({X}_{c}^{\prime}{X}_{c})=1.$$

Thus * _{c}* is a linear combination of

With collinearity, it is well known that the components of may have large variances, which will not be able to explain the good performance of * _{c}*. In fact, by

$$\mathit{Var}({\widehat{\beta}}_{c})={({X}_{c}^{\prime}{X}_{c})}^{-1}{1}^{\prime}({X}^{\prime}X)\mathit{Cov}(\widehat{\beta})({X}^{\prime}X)1{({X}_{c}^{\prime}{X}_{c})}^{-1}={\sigma}^{2}{({X}_{c}^{\prime}{X}_{c})}^{-1},$$

which is not directly related to possibly nearly singular (*X*′*X*)^{−1}. Furthermore, because the MLE from the marginal models (2) is
${\widehat{\beta}}_{M}={({X}^{\prime}X)}_{d}^{-1}{X}^{\prime}Y$, where (*X*′*X*)* _{d}* =

$${\widehat{\beta}}_{c}={({X}_{c}^{\prime}{X}_{c})}^{-1}{X}_{c}^{\prime}Y={({X}_{c}^{\prime}{X}_{c})}^{-1}{X}_{c}^{\prime}{({X}^{\prime}X)}_{d}{\widehat{\beta}}_{M}=\frac{({\sum}_{i=1}^{m}{X}_{i1}^{2},\dots ,{\sum}_{i=1}^{m}{X}_{ik}^{2}){\widehat{\beta}}_{M}}{{\sum}_{i=1}^{m}{\left({\sum}_{j=1}^{k}{X}_{ij}\right)}^{2}}.$$

For simplicity, we assume that there is no intercept term; otherwise, it can be handled similarly (but with more complicated notations). In general, there is no closed form solution for MLEs and * _{c}*, it is difficult to characterize the relationship between the two. However, with the canonical link function

$$\begin{array}{l}\sum _{i=1}^{m}{X}_{i}^{\prime}\left({Y}_{i}-{h}^{-1}({X}_{i}\widehat{\beta})\right)=0,\\ \sum _{i=1}^{m}{X}_{ij}\left({Y}_{i}-{h}^{-1}({X}_{ij}{\widehat{\beta}}_{M,j})\right)=0\phantom{\rule{0.38889em}{0ex}}\phantom{\rule{0.38889em}{0ex}}\phantom{\rule{0.38889em}{0ex}}\text{for}\phantom{\rule{0.16667em}{0ex}}j=1,\dots ,k,\\ \sum _{i=1}^{m}\sum _{j=1}^{k}{X}_{ij}\left({Y}_{i}-{h}^{-1}(\sum _{j=1}^{k}{X}_{ij}{\widehat{\beta}}_{c})\right)=0,\end{array}$$

where *h*^{−1}( ) is the inverse of the link function and its operation on a vector or matrix is element-wise. Hence,

$$\begin{array}{l}\sum _{i=1}^{m}\sum _{j=1}^{k}{X}_{ij}{h}^{-1}(\sum _{j=1}^{k}{X}_{ij}{\widehat{\beta}}_{c})=\sum _{i=1}^{m}\sum _{j=1}^{k}{X}_{ij}{Y}_{i}\\ =\sum _{i=1}^{m}\sum _{j=1}^{k}{X}_{ij}{h}^{-1}({X}_{i}\widehat{\beta})\\ =\sum _{i=1}^{m}\sum _{j=1}^{k}{X}_{ij}{h}^{-1}({X}_{ij}{\widehat{\beta}}_{M,j}),\end{array}$$

which implicitly determines a relationship between _{c} and , and * _{c}* and

Now we consider the following question: given that the joint model (1) is the true model, and thus (3) is possibly a misspecified model, what does * _{c}* converges to as the sample size goes to infinity? First, obviously, by the consistency of MLE under a correct model, converges to its true value *. Second, based on the property of MLE under a mis-specified model (Huber 1967; White 1982),

Because of the relationship between and _{c}, there is also an implied relation between *β** and
${\beta}_{c}^{\ast}$, say
${\widehat{\beta}}_{c}^{\ast}=r({\widehat{\beta}}^{\ast})$; for example, *r*( ) is a linear function in linear regression. Therefore, testing *H*_{0}:
${\beta}_{c}^{\ast}=0$ in model (3) is equivalent to testing *H _{0}*:

We formulate fitting separate marginal models as joint modeling in the context of generalized estimating equations (GEE) to derive the asymptotic distribution of * _{M}*. In GEE, for “cluster”

$$\text{Logit}\phantom{\rule{0.16667em}{0ex}}Pr({y}_{ij}=1)=\sum _{j=1}^{k}I({z}_{ij}=j){\beta}_{M,0j}+\sum _{j=1}^{k}I({z}_{ij}=j){X}_{ij}{\beta}_{M,j},$$

with a working independence correlation structure. By the GEE theory (Liang and Zeger 1986), * _{M}* has an asymptotic Normal distribution with mean

- Chapman JM, Whittaker J. Analysis of multiple SNPs in a candidate gene or region. Genetic Epidemiology. 2008;32:560–566. [PMC free article] [PubMed]
- Chapman JM, Cooper JD, Todd JA, Clayton DG. Detecting disease associations due to linkage disequilibrium using haplotype tags: a class of tests and the determinants of statistical power. Hum Hered. 2003;56:18–31. [PubMed]
- Clayton D, Chapman J, Cooper J. Use of unphased multilocus genotype data in indirect association studies. Genet Epidemiol. 2004;27:415–428. [PubMed]
- Conneely KN, Boehnke M. So many correlated tests, so little time! Rapid adjustment of p values for multiple correlated tests. Am J Hum Genet. 2007;81:1158–1168. [PubMed]
- Cox DR, Hinkley DV. Theoretical Statistics. Chapman and Hall; London: 1974.
- Daly MJ, Rioux JD, Schaffner SF, Hudson TJ, Lander ES. High-resolution haplotype structure in the human genome. Nat Genet. 2001;29:229–232. [PubMed]
- Fan R, Knapp M. Genome association studies of complex diseases by case-control designs. Am J Hum Genet. 2003;72:850–868. [PubMed]
- Gabriel SB, Schaffner SF, Nguyen H, Moore JM, Roy J, Blumenstiel B, Higgins J, DeFelice M, Lochner A, Faggart M, et al. The structure of haplotype blocks in the human genome. Science. 2002;296:2225–2229. [PubMed]
- Goeman JJ, van de Geer S, van Houwelingen HC. Testing against a high dimensional alternative. J Royal Stat Soc B. 2006;68:477–493.
- Huber PJ. Proceedings if the Fifth Berkeley Symposium in Mathematical Statistics and Probability. Berkeley: University of California Press; 1967. The behavior of maximum likelihood estimates under nonstandard conditions.
- Johnson NL, Kotz S. Distributions in Statistics, Continuous Univariate Distributions. 2. Boston: Houghton-Mifflin; 1970.
- Liang KY, Zeger SL. Longitudinal data analysis using generalized linear models. Biometrika. 1986;73:13–22.
- McCullagh P, Nelder JA. Generalized linear models. Chapman and Hall; London: 1983.
- Plenge RM, et al. TRAF1-C5 as a risk locus for rheumatoid arthritis–a genomewide study. N Engl J Med. 2007;357:1199–209. [PMC free article] [PubMed]
- Qin ZS, Niu T, Liu JS. Partition-ligation EM algorithm for haplotype inference with single nucleotide polymorphisms. Am J Hum Genet. 2002;71:1242–1247. [PubMed]
- Roeder K, Bacanu SA, Sonpar V, Zhang X, Devlin B. Analysis of single-locus tests to detect gene/disease associations. Genet Epidemiol. 2005;28:207–219. [PubMed]
- Schaid DJ, Rowland CM, Tines DE, Jacobson RM, Poland GA. Score tests for association between traits and haplotypes when linkage phase is ambiguous. Am J Hum Genet. 2002;70:425–434. [PubMed]
- Stephens M, Smith NJ, Donnelly P. A new statistical method for haplotype reconstruction from population data. Am J Hum Genet. 2001;68:978–989. [PubMed]
- Wang T, Elston RC. Improved power by use of a weighted score test for linkage disequilibrium mapping. Am J Hum Genet. 2007;80:353–360. [PubMed]
- Wei Z, Li M, Rebbeck T, Li H. U-statistics-based tests for multiple genes in genetic association studies. Annals of Human Genetics. 2008;72:821–833. [PMC free article] [PubMed]
- White H. Maximum likelihood estimation of misspecified models. Econometrica. 1982;50:1–25.
- Xiong M, Zhao J, Boerwinkle E. Generalized
*T*^{2}test for genome association studies. Am J Hum Genet. 2002;70:1257–1268. [PubMed] - Zhang J-T. Approximate and asymptotic distributions of Chi-squared-type mixtures with applications. Journal of the American Statistical Association. 2005;100:273–285.
- Zhao H, Pfiffer R, Gail MH. Haplotype analysis in population genetics and association studies. Pharmacogenomics. 2003a;4:171–178. [PubMed]
- Zhao LP, Li S, Khalid N. Assessing haplotype-based association with multiple SNPs in case-control studies. Am J Hum Genet. 2003b;72:1231–1250. [PubMed]
- Zhong W, Pan W. Power comparison of statistical tests of association with multiple SNPs with the GAW16 rheumatoid arthritis data. Research Report 2008–015, Division of Biostatistics, University of Minnesota. 2008. Available at www.biostat.umn.edu/rrs.php.

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |