Home | About | Journals | Submit | Contact Us | Français |

**|**HHS Author Manuscripts**|**PMC2674317

Formats

Article sections

Authors

Related links

Genet Epidemiol. Author manuscript; available in PMC 2010 April 1.

Published in final edited form as:

Genet Epidemiol. 2009 April; 33(3): 183–197.

doi: 10.1002/gepi.20364PMCID: PMC2674317

NIHMSID: NIHMS106122

See other articles in PMC that cite the published article.

Recently, a genomic distance-based regression for multilocus associations was proposed (Wessel and Schork [2006] *Am. J. Hum. Genet.* 79:792–806) in which either locus or haplotype scoring can be used to measure genetic distance. Although it allows various measures of genomic similarity and simultaneous analyses of multiple phenotypes, its power relative to other methods for case-control analyses is not well known. We compare the power of traditional methods with this new distance-based approach, for both locus-scoring and haplotype-scoring strategies. We discuss the relative power of these association methods with respect to five properties: (1) the marker informativity; (2) the number of markers; (3) the causal allele frequency; (4) the preponderance of the most common high-risk haplotype; (5) the correlation between the causal single-nucleotide polymorphism (SNP) and its flanking markers. We found that locus-based logistic regression and the global score test for haplotypes suffered from power loss when many markers were included in the analyses, due to many degrees of freedom. In contrast, the distance-based approach was not as vulnerable to more markers or more haplotypes. A genotype counting measure was more sensitive to the marker informativity and the correlation between the causal SNP and its flanking markers. After examining the impact of the five properties on power, we found that on average, the genomic distance-based regression that uses a matching measure for diplotypes was the most powerful and robust method among the seven methods we compared.

With the advent of high-throughput single-nucleotide polymorphism (SNP) data, using an appropriate statistical method to test the association between a large number of genetic markers and disease is of major interest. For case-control data, a multilocus association analysis is commonly used to evaluate the simultaneous association of several genetic markers with a trait. There are two broad categories of test statistics: locus-scoring methods that test only the main effects of markers, and haplotype-scoring methods that test haplotype effects. With binary traits for case-control data, logistic regression is widely used for modeling the association between markers and disease [Cordell and Clayton, 2002; Chapman et al., 2003]. Cordell and Clayton [2002] proposed a stepwise regression approach applicable to either case-control data or nuclear-family data, with case-control data modeled via unconditional and family data via conditional logistic regression. Chapman et al. [2003] derived a multilocus score test and showed that the power of association tests is determined by two coefficients of determination: that between markers and the unobserved causal locus and that between the causal locus and the trait. They also found that in most cases, the simplest method of scoring (locus coding), which does not require haplotype phase of the markers, is generally more powerful than scoring analyses that include haplotype information. The locus-scoring analysis has the merits of lower complexity and possibly fewer degrees of freedom for either the likelihood ratio test or the score test, potentially, providing greater power.

For a haplotype-scoring approach, Schaid et al. [2002] developed a score statistic for generalized linear models that tests the association between haplotypes and a wide variety of traits, including binary, ordinal, and quantitative traits. This haplotype-scoring approach can provide critical information regarding the function of a gene and can increase power when several mutations within a single gene interact to create a “super allele” that has a large effect on a disease. However, the phase ambiguity and the larger number of degrees of freedom usually compromise the potential gain in power. A similar haplotype approach, proposed by Zaykin et al. [2002], uses a likelihood ratio test to detect the association of haplotypes with discrete and continuous traits in samples of unrelated individuals. Schaid et al. [2002] and Zaykin et al. [2002] utilized the expectation-maximization (EM) algorithm to estimate the posterior distribution of pairs of haplotypes for each subject, yet power loss can occur due to haplotype phase uncertainty [Zaykin et al., 2002]. To design studies to test associations of haplotypes with traits, Schaid [2006] provided guidance to determine the sample size needed to achieve the desired power for a study. His derivations covered both phase-known and phase-unknown haplotypes, allowing evaluation of the loss of efficiency due to unknown phase.

Although haplotype-scoring approaches can be more powerful in some situations, Clayton et al. [2004] found that locus scoring of multiple tag-SNPs, not based on haplotypes, can be more powerful than analyses based on haplotypes. To investigate the ability of detecting association by locus-scoring and haplotype-scoring analyses, North et al. [2006] simulated data based on the patterns of linkage disequilibrium (LD) from the HapMap project [The International HapMap Consortium, 2003]. They used one to four markers surrounding the putative susceptibility locus as the marker sets and found that there was little difference in the performance of locus-scoring and haplotype-scoring methods.

In addition to the above methods, another branch of multilocus association methods uses genotype similarity for pairs of subjects, where genotype can span unphased markers as well as haplotype information. Here, the analyzed unit is a pair of subjects instead of a single subject. Several researchers have proposed statistics based on the excess in similarities among haplotypes in affected individuals [van der Meulen and Te Meerman, 1997; Cheung and Nelson, 1998; Grant et al., 1999; Devlin et al., 2000; Bourgain et al., 2001]. Tzeng et al. [2003a,b] and Schaid et al. [2005] proposed statistics to detect association by contrasting within-group similarities of cases and controls. Simulation studies have shown that neither Pearson's χ^{2} statistic nor the similarity-based method is uniformly more powerful than the other [Tzeng et al., 2003a]. The power of these two methods can be very different, depending on the population evolutionary history [Yuan et al., 2006]. The contrast of within-group similarities relies on a test statistic, *T* = δ/√Var(δ), where the numerator contrasts the within-group similarities, δ = ** p**′Π

A limitation of contrasting within-group similarities [Tzeng et al., 2003a,b; Schaid et al., 2005], however, is that there can exist large differences between ** p** and

Recognizing the blind area of the existing similarity-based test [Tzeng et al., 2003a,b], Sha et al. [2007] provided a new association test using haplotype similarity, to eliminate the potential blind spot. They added the average within-group similarity of cases to that of controls, and compared this total within-group similarity with the average between-group similarity for cases and controls. Their statistic is *S* = *U*_{1}/√Var(*U*_{1}), where *U* = ** p**′Π

In addition to the above similarity-based tests, a genomic distance-based regression for multilocus association analysis was recently proposed [Wessel and Schork, 2006], in which either locus or haplotype scoring can be applied. Wessel and Schork [2006] provided seven measures of genomic similarity, some based on genotypes and some based on haplotypes. The term “distance,” or “dissimilarity,” is dual with “similarity,” and so similarity can be transformed into genomic distance. A regression-based analysis was proposed to test the association of phenotype similarity with genotype similarity. This method allows various measures of genomic similarity and simultaneous analyses of multiple phenotypes. It can be used for gene expression data and can accommodate population stratification by simply including relevant covariates in the regression. Despite these merits and flexibility, the relative power of this approach for case-control data, versus some standard methods [e.g., logistic regression, Cordell and Clayton, 2002; Chapman et al., 2003] or haplotype score tests [Schaid et al., 2002]) is unknown. Surely the power performance of this new method will vary with the choice of similarity measure, and it will be helpful to have some priori knowledge about the optimal selection of similarity measure.

In this work, we first show that the genomic distance-based regression reduces to a diplotype dissimilarity test when a specific similarity measure is applied to analyses of case-control data, linking this method with the class of haplotype similarity/dissimilarity tests [Tzeng et al., 2003a,b; Yuan et al., 2006; Sha et al., 2007; Klei and Roeder, 2007]. Although there are some power studies, previous work focused on comparing the haplotype similarity tests with frequency-based statistics, such as Pearson's χ^{2} test. The relative power of the class of haplotype similarity tests with some standard association methods is not clear. Here, we compare the power of the genomic distance-based regression with that of logistic regression and score tests for haplotypes [Schaid et al., 2002] under the scenario of one causal locus simulated on genotypes from the HapMap data [The International HapMap Consortium, 2005]. From simulations, we explore the limitations and the relative power of a variety of multilocus association methods.

Suppose that there are *L* diallelic markers, and *x*_{L×1} is a vector of length *L* that codes the number of copies of an allele at each locus, with possible values of 0, 1, or 2. Let *Y* be the disease status, with 1 for affected and 0 for unaffected. The logistic regression model is

$$\mathrm{log}\frac{P(Y=1\mid \mathit{x})}{1-P(Y=1\mid \mathit{x})}={\beta}_{0}+{\beta}^{\prime}\mathit{x}.$$

(1)

To test whether any of the *L* markers are associated with the disease, we test the hypothesis *H*_{0} : **β** = 0 versus *H*_{1} : *β* ≠ **0**. The log-likelihood for ** Y** is

$$l(\beta ;\mathit{Y})=\underset{i=1}{\overset{N}{\Sigma}}[{y}_{i}({\beta}_{0}+{\beta}^{\prime}{\mathit{x}}_{i})-\mathrm{log}(1+\mathrm{exp}({\beta}_{0}+{\beta}^{\prime}{\mathit{x}}_{i}))],$$

(2)

where *N* is the number of subjects. We can test *H*_{0} versus *H*_{1} using the difference of the deviance statistics [Dobson, 2002],

$$\Delta D={D}_{0}-{D}_{1}=2\left[l(\widehat{\beta};\mathit{y})-l(\stackrel{~}{\beta};\mathit{y})\right]~{\chi}_{L}^{2},$$

(3)

where *D*_{0} is the deviance for the null model, *D*_{1} is the deviance for the alternative model, and and are the MLEs under *H*_{1} and *H*_{0}, respectively. Asymptotically, Δ*D* has a χ^{2} distribution with degrees of freedom equal to *L*. Rejection of the null hypothesis suggests that at least one of the *L* markers is associated with the disease. This method is denoted “geno-LRT.” Model (1) does not directly capture the correlation structure of SNPs as haplotypes would. Although interactions between SNPs can be included, this generally reduces power to detect an association [Chapman et al., 2003; Balding, 2006]. Thus, we only model the main effects of markers in the method “geno-LRT.”

Let ${g}_{i}^{l}$ and ${g}_{j}^{l}$ denote the genotypes of the *l*th locus for the *i*th and *j*th subjects, respectively. A similarity measure of genotypes is the average of identity-by-state (IBS) for the *L* loci, i.e.,

$${S}_{\mathit{ij}}^{G}=\frac{{\Sigma}_{l}s({g}_{i}^{l},{g}_{j}^{l})}{2L},$$

(4)

where $s({g}_{i}^{l},{g}_{j}^{l})$ is the IBS of the *i*th and *j*th subjects for the *l*th locus. Since the possible values for $s({g}_{i}^{l},{g}_{j}^{l})$ are 0, 1, or 2, ${S}_{\mathit{ij}}^{G}$ ranges from 0 to 1. We call the method based on equation (4) “geno-sim.”

Let *h _{iu}* = (

$${S}_{\mathit{ij}}^{{H}_{1}}=\frac{{\displaystyle \underset{u}{\Sigma}}{\displaystyle \underset{v}{\Sigma}}P({h}_{\mathit{iu}}\mid {g}_{i})P({h}_{\mathit{jv}}\mid {g}_{j})\times \mathrm{max}\left\{{\displaystyle \underset{l}{\Sigma}}s({h}_{\mathit{iu}1}^{l},{h}_{\mathit{jv}1}^{l})+s({h}_{\mathit{iu}2}^{l},{h}_{\mathit{jv}2}^{l}),{\displaystyle \underset{l}{\Sigma}}s({h}_{\mathit{iu}1}^{l},{h}_{\mathit{jv}2}^{l})+s({h}_{\mathit{iu}2}^{l},{h}_{\mathit{jv}1}^{l})\right\}}{2L},$$

(5)

where ${h}_{\mathit{iuc}}^{l}$ refers to the allele at position *l* on one of the two chromosomes (*c* = 1 or 2) for the *u*th possible diplotype of the *i*th subject. The score $s({h}_{\mathit{iu}1}^{l},{h}_{\mathit{jv}1}^{l})$ equals 1 if the alleles at the *l*th locus match for subjects *i* and *j*, for the first haplotype, given a particular diplotype, otherwise this score equals 0. Equation (5) is the expected haplotype-similarity over the posterior distribution of pairs of haplotypes, given the observed unphased genotypes. In this measure, max{} ensures that the similarity will not depend on the order of the two haplotypes in each possible haplotype pair.

Another measure that treats haplotypes as “super alleles” is

$${S}_{\mathit{ij}}^{{H}_{2}}=\frac{{\displaystyle \underset{u}{\Sigma}}{\displaystyle \underset{v}{\Sigma}}P({h}_{\mathit{iu}}\mid {g}_{i})P({h}_{\mathit{jv}}\mid {g}_{j})\mathrm{max}\{s({h}_{\mathit{iu}1},{h}_{\mathit{jv}1})+s({h}_{\mathit{iu}2},{h}_{\mathit{jv}2}),s({h}_{\mathit{iu}1},{h}_{\mathit{jv}2})+s({h}_{\mathit{iu}2},{h}_{\mathit{jv}1})\}}{2},$$

(6)

where *s*(*h*_{iu1}, *h*_{jv1}) equals 1 only when *all* alleles on the first chromosome for subject *i* are the same as those on the first chromosome for subject *j*, otherwise *s*(*h*_{iu1}, *h*_{jv1}) equals 0. With the appropriate standardization in the denominator, ${S}_{\mathit{ij}}^{{H}_{1}}$ and ${S}_{\mathit{ij}}^{{H}_{2}}$ range from 0 to 1. To implement the haplotype-scoring analysis based on the distance-based regression, we first used the function “haplo.em” in the package haplo.stats [Schaid et al., 2002] to infer haplotype phase by the EM algorithm. Then, the haplotype similarity between any two subjects can be calculated by equations (5) or (6). We call the method based on equation (5) “haplosim” and that based on equation (6) “haplo-match.” Equation (5) corresponds to the “counting measure” in Tzeng et al. [2003a], while equation (6) corresponds to their “matching measure”.

Tzeng et al. [2003a] has shown that the numerator of their haplotype similarity test statistic for the counting measure can be computed directly from unphased genotype data [see also Schaid, 2004a for an intuitive explanation]. To simplify notation, we assume there is no phase ambiguity. When measuring the similarity of the *i*th and *j*th subjects, each with two haplotypes, the counting measure provided by Tzeng et al. [2003a] is

$$\underset{m=1}{\overset{2}{\Sigma}}\underset{n=1}{\overset{2}{\Sigma}}\underset{l=1}{\overset{L}{\Sigma}}s({h}_{\mathit{im}}^{l},{h}_{\mathit{jn}}^{l})=\underset{l=1}{\overset{L}{\Sigma}}\underset{m=1}{\overset{2}{\Sigma}}\underset{n=1}{\overset{2}{\Sigma}}s({h}_{\mathit{im}}^{l},{h}_{\mathit{jn}}^{l}).$$

This measure is the total count of allele matches between two subjects, which does not depend on the linkage phase and so can be computed directly from unphased genotype data. However, for “haplo-sim,” the numerator of ${S}_{\mathit{ij}}^{{H}_{1}}$ is

$$\mathrm{max}\left\{\underset{l=1}{\overset{L}{\Sigma}}s({h}_{i1}^{l},{h}_{j1}^{l})+s({h}_{i2}^{l},{h}_{j2}^{l}),\underset{l=1}{\overset{L}{\Sigma}}s({h}_{i1}^{l},{h}_{j2}^{l})+s({h}_{i2}^{l},{h}_{j1}^{l})\right\},$$

which depends on the linkage phase.

If we have *N* subjects, a similarity matrix, ** S** = [

Genetic part:

- (1) Calculate a distance matrix for all pairs of
*N*subjects by= [*D**D*]_{ij}*N*×*N*= [1 −*S*]_{ij}*N*×*N*, with elements*D*ranging from 0 to 1._{ij} - (2) Compute $\mathit{A}={\left[{A}_{\mathit{ij}}\right]}_{N\times N}={[-\frac{1}{2}{D}_{\mathit{ij}}^{2}]}_{N\times N}$.
- (3) Center
according to ${\mathit{G}}_{N\times N}=(\mathit{I}-\frac{1}{N}{11}^{\prime})\mathit{A}(\mathit{I}-\frac{1}{N}{11}^{\prime})$, where*A*is the identity matrix and*I*is a vector with all elements 1 [see Wessel and Schork, 2006 for more details].*1*

Phenotype part:

- (4) Denote the
*N*-vectorwith elements “−1” for controls and “1” for cases.*y* - (5) Compute the projection matrix
*H*_{N×N}=(*y*′*y*)*y*^{−1}′. Note that*y*is a similarity matrix for the phenotype.*H*

The pseudo-*F* statistic:

- (6) Compute the pseudo-
*F*statistic by$$F=\frac{\mathbf{tr}\left(\mathit{HGH}\right)}{\mathbf{tr}\left[(\mathit{I}-\mathit{H})\mathit{G}(\mathit{I}-\mathit{H})\right]},$$(7)

where tr() denotes the trace of a matrix. This statistic follows the conceptual framework of multivariate analysis of variance (MANOVA) [Anderson, 2001]. Because the distribution of *F* is unknown, permutations of traits are used to obtain empirical *p* values. Note that coding 0 for controls is not appropriate for step 4 because that would obscure the phenotype similarity with the multiplication in step 5. With similarity measured between any two subjects and *p* values obtained by permutations, the computational burden for this method is heavier than other approaches. However, this method is attractive because it flexibly allows various similarity measures, as well as simultaneous analysis of multiple phenotypes, either binary or continuous traits, by use of the projection matrix in step 5. This approach differs from those proposed by Tzeng et al. [2003a,b] and Schaid et al. [2005], which measure the difference of within-group similarities among haplotypes or genotypes between cases and controls. Because Wessel and Schork's [2006] approach uses both the between-group distance and the within-group distances, there is no blind area.

When applying the genomic distance-based regression to a balanced case-control study (equal numbers of cases and controls), the pseudo-*F* statistic can be simplified to

$$F=\frac{{\mathit{p}}^{\prime}{\Pi}_{{D}^{2}}\mathit{q}}{{\mathit{p}}^{\prime}{\Pi}_{{D}^{2}}\mathit{p}+{\mathit{q}}^{\prime}{\Pi}_{{D}^{2}}\mathit{q}}-\frac{1}{2},$$

(8)

where ** p** and

When computing genotype distances (ignoring phase) for *L* diallelic markers, ** p** and

When computing distances between diplotypes (pairs of haplotypes), suppose that there are *n _{H}* distinct haplotypes. Then,

Computing the pseudo-*F* statistic with equation (8) can be time consuming when many markers or haplotypes are involved, because the size of the matrix Π_{D2} can be quite large. While the computational intensity of equation (8) increases with the number of markers or the number of distinct haplotypes, that of equation (7) increases with the number of subjects. From equation (8), the pseudo-*F* statistic is determined by the ratio of the between-group distance to the total within-group distance. Under the null hypothesis, *H*_{0} : ** p** =

$$F=\frac{2{\mathit{p}}^{\prime}{\Pi}_{{D}^{2}}\mathit{q}-({\mathit{p}}^{\prime}{\Pi}_{{D}^{2}}\mathit{p}+{\mathit{q}}^{\prime}{\Pi}_{{D}^{2}}\mathit{q})}{2({\mathit{p}}^{\prime}{\Pi}_{{D}^{2}}\mathit{p}+{\mathit{q}}^{\prime}{\Pi}_{{D}^{2}}\mathit{q})},$$

(9)

with the numerator illustrating the contrast of between-group distance with the total within-group distance. The statistic proposed by Sha et al. [2007] can be expressed as

$$\begin{array}{cc}\hfill {U}_{1}& ={\mathit{p}}^{\prime}{\Pi}_{S}\mathit{p}+{\mathit{q}}^{\prime}{\Pi}_{S}\mathit{q}-2{\mathit{p}}^{\prime}{\Pi}_{S}\mathit{q}\hfill \\ \hfill & =(1-{\mathit{p}}^{\prime}{\Pi}_{D}\mathit{p})+(1-{\mathit{q}}^{\prime}{\Pi}_{D}\mathit{q})-2(1-{\mathit{p}}^{\prime}{\Pi}_{D}\mathit{q})\hfill \\ \hfill & =2{\mathit{p}}^{\prime}{\Pi}_{D}\mathit{q}-({\mathit{p}}^{\prime}{\Pi}_{D}\mathit{p}+{\mathit{q}}^{\prime}{\Pi}_{D}\mathit{q}),\hfill \end{array}$$

(10)

which is similar to the numerator of the pseudo-*F* statistic.

Canonical correlation can be used to measure the correlation between genotypes and multiple phenotypes. We applied it to case-control data because of its simplicity and ability to analyze multiple phenotypes. Conceptually, canonical correlation creates two new variables for two sets of variables such that the correlation of the two new variables is maximized. Let the genetic set, ** x** = (

$$\Sigma =\left[\begin{array}{cc}\hfill {\Sigma}_{\mathit{yy}}\hfill & \hfill {\Sigma}_{\mathit{yx}}^{\prime}\hfill \\ \hfill {\Sigma}_{\mathit{yx}}\hfill & \hfill {\Sigma}_{\mathit{xx}}\hfill \end{array}\right],$$

(11)

where ${\Sigma}_{\mathit{yx}}^{\prime}=({\widehat{\sigma}}_{{\mathit{yx}}_{1}},\dots ,{\widehat{\sigma}}_{{\mathit{yx}}_{L}})$ contains the sample covariances of ** y** with

$${R}^{2}=\frac{{\Sigma}_{\mathit{yx}}^{\prime}{\Sigma}_{\mathit{xx}}^{-1}{\Sigma}_{\mathit{yx}}}{{\widehat{\sigma}}_{y}^{2}},$$

(12)

where ${\widehat{\sigma}}_{y}^{2}={\Sigma}_{\mathit{yy}}$ is the sample variance of ** y**. The multiple correlation

$${R}^{2}=\mathrm{det}\left({\Sigma}_{\mathit{yy}}^{-1}{\Sigma}_{\mathit{yx}}^{\prime}{\Sigma}_{\mathit{xx}}^{-1}{\Sigma}_{\mathit{yx}}\right),$$

(13)

where det() denotes the determinant of a matrix. This provides a simultaneous test of association for more than one phenotype, and it is implemented by the function “cancor” in the R package. A permutation *p* value is used to test whether the *L* markers are associated with the phenotype. An advantage of canonical correlation is that the computational speed is much faster than that for the genomic distance-based regression because comparisons between all pairs of subjects are not required. This method is denoted “can-cor.”

Our simulation compared the five methods described in the previous section (denoted “geno-LRT,” “geno-sim,” “haplo-sim,” “haplo-match,” and “can-cor”), as well as “haplo-score,” a global score test for haplotypes, and “haplo-max,” the maximum score statistic over all haplotype scores. The last two tests were proposed by Schaid et al. [2002] and provided by their package haplo.stats. When computing the haplotype-scoring distance-based statistics (“haplo-sim” and “haplo-match”), we first inferred haplotype phases by the EM algorithm. All the haplotypes with frequencies less than 0.01, a cutoff value suggested by Sha et al. [2007], were considered to be rare haplotypes and were merged with their most similar common haplotype. When more than one common haplotype had the same counting similarity with a rare haplotype, the rare haplotype was merged with the common haplotype with the highest frequency. When computing the score statistics for haplotypes (“haplo-score” and “haplo-max”), only haplotypes with frequencies larger than 0.01 were scored. We evaluated these seven methods in the scenario of one causal locus simulated on genotypes from the HapMap data. Power comparisons were based on 1,000 repetitions; in each repetition, *p* values were calculated by 1,000 permutations.

To study association methods under real human LD structure, we downloaded SNP genotype data on a chromosomal region from the HapMap website (www.hapmap.org). We used data from CEU - CEPH (Utah residents with ancestry from northern and western Europe), and kept the genotypes of 60 unrelated subjects—parents in the 30 trios comprising the CEPH data set. A total of 25 SNPs on chromosome 17 were selected, chosen to have minor allele frequency >5% and without missing genotypes, spanning 68.2 kb (40,248,321–40,316,535), yielding an average (median) distance between SNPs of 2.84 kb (1.65 kb). These 25 SNPs were highly correlated, with an average *D*′ of 0.97 for all adjacent loci. Following North et al. [2006], to model the effects of the susceptibility locus for a complex disease, we assigned the probabilities of being affected, conditional on possessing 0, 1, or 2 copies of the causal allele, as 0.029, 0.076, and 0.214, respectively. These penetrances mimic Alzheimer's disease for the *APOE-4* genotype [Kuusisto et al., 1994]. The odds ratios for genotypes *Aa* and *AA*, relative to genotype *aa*, were 2.75 and 9.12, respectively. To simulate one causal locus, we let the rarer allele be the causal allele and we let each SNP locus be the disease susceptibility locus, with the remaining 24 SNPs serving as markers. The causal SNP was assumed not to be genotyped and so was not contained in the analysis. According to the genotype frequencies for each SNP, the prevalence of the disease in the population would vary from 3.84 to 7.87%, with an average of 5.35%. The total sample size was set at 100 subjects, of which half were cases and half were controls. In each repetition, 50 cases and 50 controls were sampled with replacement from the CEPH data composed of 60 unrelated subjects, and the disease status was generated according to the genotypes of the causal SNP and the disease model.

We further extended the above simulation scenario to fewer markers, with more HapMap data on different chromosomes. For chromosome 17, we selected eight SNPs from 25 SNPs, by the clustering method [Tzeng et al., 2003a]. With the clustering method, six common haplotypes were retained, and the other five rare haplotypes were clustered into one of the six categories through a one-step or two-step mutation. The retained haplotypes were constructed by eight SNPs. In addition, we randomly selected regions along chromosomes, to collect more genotype data. The background information of these chromosomal regions is listed in Table I. Every studied chromosomal region spanned within 100 kb, and we list the average pairwise LD for all loci within regions in Table I. We did not study the power under the situation of weak LD between markers, because multilocus association analyses would not be preferred in that situation. For diplotype data, haplotypes are treated as partially missing when haplotype frequencies are estimated among unrelated subjects. This ambiguity can increase the variance of the estimated haplotype frequencies, reducing the statistical efficiency especially when LD is weak [Schaid, 2002]. Furthermore, with weak LD, there are many more distinct haplotypes, leading to weak power of haplotype methods.

We evaluated the Type-I error rate of each method under the 89 simulation scenarios described above, but now generated case-control status independent of genotype. That is, we allowed the SNP allele frequencies to vary, as well as the LD structure, according to that in the CEPH data. The total sample size was set at 100 subjects, of which half were cases and half were controls. Simulation results were based on 1,000 repetitions; in each repetition, *p* values were calculated by 1,000 permutations. We also allowed a reasonable range of genotyping errors in the data, described later in the “Power Comparisons in the Presence of Genotyping Errors.” The Type-I error rates of each method under three nominal significance levels are presented in Table II. All seven methods were quite conservative with our sampling from the small CEPH diplotype pool. This conservativeness is likely because the CEPH is a small pool to sample from, causing many duplicates in the case-control samples. These duplicates cause ties in the resulting test statistics, leading to conservative test results. Indeed, when we simulated from a larger haplotype pool using the coalescent-based program ms [Hudson, 2002], in most situations, the over-conservativeness faded away and the Type-I error rates were close to the nominal significance levels.

Figure 1(a) presents the overall power performance of the seven methods, showing that the best method was “haplo-match.” The methods “geno-sim” and “haplo-sim” have similar power, because the diplotype similarity based on the counting measure largely depends on genotypes. These two methods can be viewed as a group. Another group includes the methods “geno-LRT” and “can-cor.” They are asymptotically equivalent, and are expected to have similar power under large sample sizes. We discuss the relative power of the seven association methods with respect to five properties: (1) the marker informativity; (2) the number of markers; (3) the causal allele frequency; 4) the preponderance of the most common high-risk haplotype; (5) the LD pattern between the causal SNP and its flanking markers.

Power according to five properties. The *x*-axis gives a number for each method and the *y*-axis is the average power under the significance level 0.05. (a) The overall power performance—the average power over all 89 scenarios. (b) The power performance **...**

Figure 1(b) presents the power performance stratified by the marker informativity, showing notable power loss of “geno-sim” and “haplo-sim” due to low marker informativity. Here we used the average of minor allele frequencies of markers as an index for marker informativity, and a threshold 0.215 was set to classify the marker informativity into two categories. Note that the threshold was not absolute to categorize high or low marker informativity—we chose it for convenience of explanation. Power of “geno-sim” and “haplo-sim” relies on higher marker informativity. To obtain better power of these two methods, intermediate marker allele frequency (close to 0.5, representing higher marker informativity) is required—a similar result reported by Klei and Roeder [2007]. The intuition is that subjects possessing common alleles are not easily distinguished, and because we use markers to detect the unobserved causal SNP, high marker informativity will help to distinguish subjects. In contrast, “haplo-match” does not suffer from such a great loss in power due to low marker informativity. The haplotypes constructed by SNPs can serve as many alleles on a highly informative marker, and “haplo-match” is like “geno-sim’ working on an informative marker. Thus, “haplo-match’ is more powerful than other distance-based approaches, especially when the marker informativity is not high.

The number of markers that should be considered simultaneously in multilocus association analyses remains a difficult balance between degrees of freedom and power. As illustrated in Figure 1(c), no general power trend can be deciphered for this factor. Because of the dependence between the marker informativity and the number of markers, we examined the power performance stratified by the marker informativity and the number of markers. Due to coincidence, all the 25 scenarios using 24 markers were classified into the “High” group for the marker informativity. Figure 2(a) shows that locus-based logistic regression (i.e., “geno-LRT” and “can-cor”) was relatively less powerful when many markers were included in the analyses, because more markers diluted the association signal and this method does not directly capture the haplotype LD structure of the SNPs.

The causal allele frequency and the preponderance of the most common high-risk haplotype are unknown when conducting association analyses, so studying their impact on the power performance might be limiting. Nonetheless, it is worthwhile to know the relative power of different methods. Figure 1(d) presents the power performance for different levels of the causal allele frequencies, which have been classified into three groups: Rare [0.05, 0.15); Intermediate [0.15, 0.25); Common [0.25, 0.5). For all levels of causal allele frequencies, “haplo-match” was the most powerful method. When the causal allele was rare, “haplo-max” had comparable power with “haplo-match.” Because lower causal allele frequency was confounded with fewer high-risk haplotypes (when the causal allele was rarer, it tended to occur on one or few haplotypes), “haplo-max” was also powerful when the causal allele was rare. In general, the causal allele frequency did not seem to make a crucial difference to the relative power performances of these seven methods.

To aid the interpretation of our simulations, we created a haplotype preponderance index. Suppose that there are *H* high-risk haplotypes, *h*_{1}; *h*_{2}, …, *h _{H}* (from the most common to the least common), and the haplotype frequencies for them are

The correlation between the causal SNP and its neighboring markers plays an important role in multilocus association studies. We expect power loss due to low LD between the unobserved causal SNP and the neighboring markers, because low LD implies the lack of information from markers to make correct inference. To investigate this, we categorized the squared correlation coefficient (*r*^{2}) between the casual SNP and its adjacent markers into three groups. The LD of a simulation scenario was labeled as “High” if the causal SNP was in high correlation with at least one adjacent marker (*r*^{2}>0.6); “Moderate” if 0.15<*r*^{2} ≤ 0.6; “Low” if *r*^{2} 0.15. Figure 1(f) shows the power performance stratified by this pattern. When the causal SNP had high LD with at least one of its flanking markers, the three methods derived from the genomic distance-based regression attained higher power than the conventional methods. However, when LD was not high, the power loss was substantial for “geno-sim” and “haplo-sim” (i.e., the genomic distance-based regression with the counting measures). In contrast, the conventional methods and the genomic distance-based regression with the matching measure did not suffer from such a dramatic loss in power, though there was some power loss due to reduced information from markers.

Our results for the influence of LD pattern on power of the genomic distance-based regression that used the counting measures were similar to those for a new regression-based multimarker test that uses haplotype similarity [Tzeng et al., 2007]. Tzeng et al. [2007] proposed a gene-trait similarity regression analytically united with the variance-component approaches [Tzeng and Zhang, 2007]; see similar derivations by Goeman et al. [2004]. Although Tzeng et al. [2007] reversed the roles of genetic similarity and trait similarity compared with the regression system of Wessel and Schork's, the two methods should have similar power because both methods measure correlations between these two measures of similarity. Tzeng et al. [2007] observed that their gene-trait similarity regression with the counting measure can be more powerful than the standard regression [Schaid et al., 2002] when the causal SNP was tagged (with *r*^{2}>0.7 defining tagged) by at least one nearby marker, but suffered from a greater loss in power when the causal SNP was not tagged (*r*^{2} ≤ 0.7). In our work, we found that using the matching measure in the similarity regression can reduce the loss in power when the causal SNP was not tagged, probably because the whole haplotype provided greater information for the unobserved causal SNP.

For the properties of marker informativity and LD pattern, we observed that “geno-sim” and “haplo-sim” performed well when the marker informativity and the correlation between the causal SNP and markers were high. We further examined the power performance stratified by the marker informativity and the LD pattern. Figure 2(b) shows that both factors have a significant influence on the power of association methods, especially for “geno-sim” and “haplo-sim.”

Overall, our results suggest that “haplo-match” was a better method because it performed well under a variety of situations. The methods “geno-sim” and “haplo-sim” can be adopted if high marker informativity and a strong correlation between causal SNP and markers can be assured. The locus-based logistic regression, “geno-LRT” and “can-cor,” had relatively low power when many unassociated SNPs were involved in the analyses. The power of “haplo-max” was comparable to that of “haplo-match,” when the preponderance of the most common high-risk haplotype was high.

To evaluate which factors had the strongest independent effects on the power of each method, we regressed the simulated power of each method on six variables, and used a stringent level of significance, 0.01. Note that some factors might still influence power (i.e., *p* value >0.01), but we wanted to screen for the most influential factors. The parameter estimates of significant variables are listed in Table III, with *p* values listed in parentheses. We found that the causal allele frequency and the LD pattern between the causal SNP and its flanking markers played key roles in predicting the power of all seven methods. The small *p* values suggested their significant influences, and the positive parameter estimates suggested positive associations. These results are similar to those formed in the stratified analyses. Higher causal allele frequency and stronger correlation between causal SNP and markers increase power of all methods. In addition to “geno-sim” and “haplo-sim,” “geno-LRT,” “can-cor,” and “haplo-score” can also be improved by enhancing the marker informativity. On the other hand, most of the haplotype-scoring methods (“haplo-match” and “haplo-max”) are less influenced by low per SNP marker informativity, because haplotypes constructed by SNPs can be viewed as many alleles on a highly informative marker.

Regression of simulated power on six properties for each association statistic, with most significant factors listed (*p* value <0.01)

Because of many degrees of freedom, the locus-based logistic regression (“geno-LRT” and “can-cor”) and the global score test (“haplo-score”) suffered from power loss when 24 markers were included in the analyses. As a strategy to reduce the degrees of freedom, the similarity-based methods (“geno-sim,” “haplo-sim,” and “haplo-match”) were less vulnerable to more markers or more haplotypes. Finally, when one high-risk haplotype was most frequent, the maximum score statistic over all haplotype scores (“haplo-max”) had good power.

As discussed by Tzeng et al. [2003a] and Sha et al. [2007], the matching measure is not robust to genotyping errors, missing data, and recent marker mutation. We further compared the power of the seven methods in the presence of genotyping errors. The error model considered in our simulation was the asymmetric allele dropout model [Morris and Kaplan, 2004], where heterozygotes were misclassified twice as frequently as homozygotes. We considered error rates *γ*_{0→1} = 0.025, *γ*_{0→2} = 0, *γ*_{1→0} = 0.05, *γ*_{1→2} = 0.05, *γ*_{2→0} = 0, *γ*_{2→1} = 0.025, where *γ*_{i→j} was the conditional probability that true genotype *g _{i}* was identified as genotype

Similar to previous presentations, the power results in the presence of genotyping errors are shown in Table IV, Figures Figures33 and and4.4. The genotyping errors did not plague the method “haplo-match” very much, which remained a better method in the sense that it was more robust to the varying marker informativity and the low LD between the causal SNP and its neighboring markers. This suggested that the matching measure was still a desirable similarity measure regarding a reasonable range of genotyping errors. Among the seven methods, although the distance-based approach that used the counting measure (“geno-sim” and “haplo-sim”) seemed to be less vulnerable to genotyping errors, it suffered from a large drop in power when the marker informativity or the LD between the causal SNP and its neighboring markers was low.

Power according to marker informativity, number of markers, and the linkage disequilibrium (LD) pattern (in the presence of genotyping errors).

Genetic epidemiologists have struggled to identify genetic polymorphisms involved in the development of diseases. Many have collected data on large numbers of markers, but the most powerful analysis methods are not always obvious [Heidema et al., 2006]. For multilocus association studies, it is important to choose powerful and appropriate statistical methods that are designed to relate genotype or haplotype information to phenotypes of interest. The most powerful method for multilocus association studies, however, changes under different genetic architectures, such as the patterns of LD, and the number of causal loci located in a chromosomal region. There has been a considerable debate about whether one should use a locus-scoring approach or a haplotype-scoring approach [Chapman et al., 2003; Clayton et al., 2004; North et al., 2006; Humphreys and Iles, 2005; Bardel et al., 2006]. A locus-scoring approach is appealing because it does not require haplotype phase resolution. A haplotype-scoring approach can be more powerful in some cases, such as when several mutations within a single gene interact to create a “super allele” that has a large effect on a disease [Schaid et al., 2002]. The HapMap project [The International HapMap Consortium, 2005] characterizes patterns of LD in the human genome, and ideas like haplotype blocks [Cardon and Abecasis, 2003] might prove useful to map complex human trait loci. If all SNPs within a haplotype block are highly correlated among themselves, then any SNP within the block should capture sufficient information to fully interrogate a particular region of the genome, allowing one to economically detect an association, with large samples, possibly also allowing one to rule out the role of a region if no significant association is detected [Schaid 2004a,b]. Because haplotypes are difficult to directly measure over long stretches of DNA in diploid organisms, ambiguous phase can lead to a loss of statistical efficiency [Douglas et al., 2001; Schaid, 2002].

To evaluate previously proposed association methods under various genetic situations, we compared the power for three locus-scoring approaches (“geno-sim,” “geno-LRT,” “can-cor”) and four haplotype-scoring approaches (“haplo-sim,” “haplo-score,” “haplo-match,” “haplo-max”). The method “geno-LRT” is a likelihood ratio test for logistic regression that models the main effects of loci, while “can-cor” is equivalent to a score test for logistic regression. These two tests are asymptotically equivalent. The methods “geno-sim,” “haplo-sim,” and “haplo-match” are based on the distance-based regression [Wessel and Schork, 2006], for locus coding and two ways of haplotype coding, respectively. They are derived from an approach that involves similarity for pairs of subjects based on their diploid genotypes at multiple loci in the region of interest, and relates variation in a measure of genotype similarity to variation in a measure of trait similarity. The method “haplo-score” is a global score test that includes phase resolution by the EM algorithm, while “haplo-max” is the maximum score statistic among all haplotype-specific scores [Schaid et al., 2002].

We have shown that the distance-based regression can be included in the class of haplotype similarity tests [Yuan et al., 2006; Sha et al., 2007] when a specific similarity measure is used. In fact, Wessel and Schork's method is more general because it allows various similarity measures and multiple phenotypes. We evaluated the power of seven association methods under the scenario of one causal locus simulated on genotypes from the HapMap data [The International HapMap Consortium, 2005]. Based on our simulation results, the distance-based regression that uses the matching measure of diplotypes had better power under a variety of situations. The maximum score statistic over all haplotype scores can have comparable power, however, it suffered from power loss when there were several high-risk haplotypes of equal frequency.

With similarity measured between any two subjects and the need for permutation *p* values, the distance-based regression is computationally intensive. Because of this, it was difficult to study more similarity measures. Nonetheless, because this procedure depends critically on the choice of similarity measure, it is expected to have different power performance for other genetic architectures. Different from testing the association between disease and the haplotypes [Sha et al., 2007], the distance-based regression, using equations (5) or (6) as the similarity measure, tests the association between disease and the diplotypes. We have shown that the distance-based regression, using the similarity measure in equation (6), performs well under a variety of situations.

Following Sha et al. [2007], we merged rare haplotypes with similar common haplotypes, which may provide the advantage of robustness to genotyping errors. According to previous studies [Tzeng et al., 2003a; Sha et al., 2007], the matching measure is not robust to genotyping errors, missing data, and recent marker mutation. We also evaluated the power of the seven methods under these situations. Our results showed that genotyping errors did compromise the power of the method “haplo-match,” but not much. Although “geno-sim” and “haplo-sim” were more robust to genotyping errors, they were less desirable than “haplo-match,” when the marker informativity or the correlation between the causal SNP and markers was low. Generally speaking, the power trends were quite similar to those for the situation of no genotyping errors.

The statistic of the distance-based regression is similar to that proposed by Sha et al. [2007]. The simulation results of Sha et al. [2007] showed that on average their tests are more powerful than the χ^{2} test and the tests proposed by Tzeng et al. [2003a], and which test and similarity measure to use depends upon the nature of the markers. Our study further discussed the impact of marker properties on power of several prevailing association methods. Results of Sha et al. [2007] also showed that the matching measure is better than other measures when there is only one high-risk haplotype. On the other hand, when there are several high-risk haplotypes, the counting measure is better. However, the average performances of varying similarity measures do not have big differences. Our results showed that the key points to evaluate the counting measure and the matching measure would be the marker informativity and the correlation between the causal SNP and markers. The preponderance of the most common high-risk haplotype did not influence the power of the two similarity measures very much.

Finally, although canonical correlation can also measure the correlation between genotypes and multiple phenotypes, we did not compare the power between it and the distance-based regression for analyses of multiple phenotypes. Here, we focused on a single phenotype, disease status, for case-control studies. Exploring powerful association approaches for multiple phenotypes deserves further research.

We are grateful for the constructive comments from the anonymous reviewers that improved this work. We also thank the investigators and participants in the International HapMap Project for making the data available to the scientific community. This research was supported by the US Public Health Service, National Institutes of Health, contract grant number GM065450 (D.J.S.), and the *Graduate Students Visiting Abroad Scholarship* awarded by the National Science Council of Taiwan (W.-Y.L).

Contract grant sponsor: VS Public Health Service; Contract grant sponsor: National Institutes of Health; Contract grant number: GM065450; Contract grant sponsor: National Science Council of Taiwan.

If there are *N* subjects, of which half are controls and half are cases, the numerator of the pseudo-*F* statistic is

$$\begin{array}{cc}\hfill & \mathbf{tr}\left(\mathit{HGH}\right)=\mathbf{tr}\left[\mathit{H}(\mathit{I}-\frac{1}{N}{11}^{\prime})\mathit{A}(\mathit{I}-\frac{1}{N}{11}^{\prime})\mathit{H}\right]=\mathbf{tr}\left[\mathit{HAH}\right]\hfill \\ \hfill & \phantom{\rule{thinmathspace}{0ex}}=\frac{1}{{N}^{2}}\mathbf{tr}\left[\left[\begin{array}{cc}\hfill +{11}^{\prime}\hfill & \hfill -{11}^{\prime}\hfill \\ \hfill -{11}^{\prime}\hfill & \hfill +{11}^{\prime}\hfill \end{array}\right]\left[\begin{array}{cc}\hfill {\mathit{A}}_{11}\hfill & \hfill {\mathit{A}}_{12}\hfill \\ \hfill {\mathit{A}}_{12}^{\prime}\hfill & \hfill {\mathit{A}}_{22}\hfill \end{array}\right]\left[\begin{array}{cc}\hfill +{11}^{\prime}\hfill & \hfill -{11}^{\prime}\hfill \\ \hfill -{11}^{\prime}\hfill & \hfill +{11}^{\prime}\hfill \end{array}\right]\right]\hfill \\ \hfill & \phantom{\rule{thinmathspace}{0ex}}=\frac{1}{{N}^{2}}\mathbf{tr}{\left[\begin{array}{cc}\hfill {11}^{\prime}{\mathit{A}}_{11}{11}^{\prime}+{11}^{\prime}{\mathit{A}}_{22}{11}^{\prime}-{11}^{\prime}{\mathit{A}}_{12}{11}^{\prime}-{11}^{\prime}{\mathit{A}}_{12}^{\prime}{11}^{\prime}\hfill & \hfill -{11}^{\prime}{\mathit{A}}_{11}{11}^{\prime}-{11}^{\prime}{\mathit{A}}_{22}{11}^{\prime}+{11}^{\prime}{\mathit{A}}_{12}{11}^{\prime}+{11}^{\prime}{\mathit{A}}_{12}^{\prime}{11}^{\prime}\hfill \\ \hfill -{11}^{\prime}{\mathit{A}}_{11}{11}^{\prime}-{11}^{\prime}{\mathit{A}}_{22}{11}^{\prime}+{11}^{\prime}{\mathit{A}}_{12}{11}^{\prime}+{11}^{\prime}{\mathit{A}}_{12}^{\prime}{11}^{\prime}\hfill & \hfill {11}^{\prime}{\mathit{A}}_{11}{11}^{\prime}+{11}^{\prime}{\mathit{A}}_{22}{11}^{\prime}-{11}^{\prime}{\mathit{A}}_{12}{11}^{\prime}-{11}^{\prime}{\mathit{A}}_{12}^{\prime}{11}^{\prime}\hfill \end{array}\right]}^{\prime}\hfill \\ \hfill & =\frac{2}{{N}^{2}}\mathbf{tr}\left[{11}^{\prime}{\mathit{A}}_{11}{11}^{\prime}+{11}^{\prime}{\mathit{A}}_{22}{11}^{\prime}-{11}^{\prime}{\mathit{A}}_{12}{11}^{\prime}-{11}^{\prime}{\mathit{A}}_{12}^{\prime}{11}^{\prime}\right]\hfill \\ \hfill & \phantom{\rule{thinmathspace}{0ex}}=\frac{1}{N}\left[\Sigma {\mathit{A}}_{11}+\Sigma {\mathit{A}}_{22}-2\Sigma {\mathit{A}}_{12}\right]\hfill \\ \hfill & \phantom{\rule{thinmathspace}{0ex}}=\frac{N}{4}\left[{\mathit{p}}^{\prime}{\Pi}_{A}\mathit{p}+{\mathit{q}}^{\prime}{\Pi}_{A}\mathit{q}-2{\mathit{p}}^{\prime}{\Pi}_{A}\mathit{q}\right],\hfill \end{array}$$

where ** A** is the association matrix from step 2,

$$\begin{array}{cc}\hfill & \mathbf{tr}\left[(\mathit{I}-\mathit{H})\mathit{G}(\mathit{I}-\mathit{H})\right]\hfill \\ \hfill & \phantom{\rule{1em}{0ex}}=\mathbf{tr}\left[(\mathit{I}-\mathit{H})(I-\frac{1}{N}{11}^{\prime})\mathit{A}(\mathit{I}-\frac{1}{N}{11}^{\prime})(\mathit{I}-\mathit{H})\right]\hfill \\ \hfill & \phantom{\rule{1em}{0ex}}=\mathbf{tr}\left[\left[\begin{array}{cc}\hfill \mathbf{I}-\frac{2}{N}{11}^{\prime}\hfill & \hfill \mathit{O}\hfill \\ \hfill \mathit{O}\hfill & \hfill \mathit{I}-\frac{2}{N}{11}^{\prime}\hfill \end{array}\right]\left[\begin{array}{cc}\hfill {\mathit{A}}_{11}\hfill & \hfill {\mathit{A}}_{12}\hfill \\ \hfill {\mathit{A}}_{12}^{\prime}\hfill & \hfill {\mathit{A}}_{22}\hfill \end{array}\right]\left[\begin{array}{cc}\hfill \mathit{I}-\frac{2}{N}{11}^{\prime}\hfill & \hfill \mathit{O}\hfill \\ \hfill \mathit{O}\hfill & \hfill \mathit{I}-\frac{2}{N}{11}^{\prime}\hfill \end{array}\right]\right]\hfill \\ \hfill & \phantom{\rule{1em}{0ex}}=\mathbf{tr}\left[\begin{array}{cc}\hfill (\mathit{I}-\frac{2}{N}{11}^{\prime}){\mathit{A}}_{11}(\mathit{I}-\frac{2}{N}{11}^{\prime})\hfill & \hfill (\mathit{I}-\frac{2}{N}{11}^{\prime}){\mathit{A}}_{12}(\mathit{I}-\frac{2}{N}{11}^{\prime})\hfill \\ \hfill (\mathit{I}-\frac{2}{N}{11}^{\prime}){\mathit{A}}_{12}^{\prime}(\mathit{I}-\frac{2}{N}{11}^{\prime})\hfill & \hfill (\mathit{I}-\frac{2}{N}{11}^{\prime}){\mathit{A}}_{22}(\mathit{I}-\frac{2}{N}{11}^{\prime})\hfill \end{array}\right]\hfill \\ \hfill & \phantom{\rule{1em}{0ex}}=\mathbf{tr}\left[(\mathit{I}-\frac{2}{N}{11}^{\prime}){\mathit{A}}_{11}(\mathit{I}-\frac{2}{N}{11}^{\prime})\right]+\mathbf{tr}\left[(\mathit{I}-\frac{2}{N}{11}^{\prime}){\mathit{A}}_{22}(\mathit{I}-\frac{2}{N}{11}^{\prime})\right]\hfill \\ \hfill & \phantom{\rule{1em}{0ex}}=-\frac{2}{N}\times \left[\Sigma {\mathit{A}}_{11}+\Sigma {\mathit{A}}_{22}\right]\hfill \\ \hfill & \phantom{\rule{1em}{0ex}}=-\frac{N}{2}({\mathit{p}}^{\prime}{\Pi}_{A}\mathit{p}+{\mathit{q}}^{\prime}{\Pi}_{A}\mathit{q}),\hfill \end{array}$$

where ** O** is a zero matrix with dimension (

$$\begin{array}{cc}\hfill F& =\frac{\mathbf{tr}\left(\mathit{HGH}\right)}{\mathbf{tr}\left[\left(\mathit{IH}\right)\mathit{G}\left(\mathit{IH}\right)\right]}\hfill \\ \hfill & \phantom{\rule{1em}{0ex}}\phantom{\rule{thickmathspace}{0ex}}=\frac{\frac{1}{N}\left[\Sigma {\mathit{A}}_{11}+\Sigma {\mathit{A}}_{22}-2\Sigma {\mathit{A}}_{12}\right]}{-\frac{2}{N}\times \left[\Sigma {\mathit{A}}_{11}+\Sigma {\mathit{A}}_{22}\right]}\hfill \\ \hfill & \phantom{\rule{1em}{0ex}}\phantom{\rule{thickmathspace}{0ex}}=\frac{\Sigma {\mathit{A}}_{12}}{\Sigma {\mathit{A}}_{11}+\Sigma {\mathit{A}}_{22}}-\frac{1}{2}\hfill \\ \hfill & \phantom{\rule{1em}{0ex}}\phantom{\rule{thickmathspace}{0ex}}=\frac{{\mathit{p}}^{\prime}{\Pi}_{A}\mathit{q}}{{\mathit{p}}^{\prime}{\Pi}_{A}\mathit{p}+{\mathit{q}}^{\prime}{\Pi}_{A}\mathit{q}}-\frac{1}{2}\hfill \\ \hfill & \phantom{\rule{1em}{0ex}}\phantom{\rule{thickmathspace}{0ex}}=\frac{{\mathit{p}}^{\prime}{\Pi}_{{D}^{2}}\mathit{q}}{{\mathit{p}}^{\prime}{\Pi}_{{D}^{2}}\mathit{p}+{\mathit{q}}^{\prime}{\Pi}_{{D}^{2}}\mathit{q}}-\frac{1}{2},\hfill \end{array}$$

where the (*i*,*j*)th element in Π_{A} is ${\left[{\Pi}_{A}\right]}_{\mathit{ij}}=-\frac{1}{2}{\left[{\Pi}_{D}\right]}_{\mathit{ij}}^{2}$, because of step 2

Following the logistic regression model in equation (1) and the log-likelihood shown in equation (2), the score vector for ** β** is

$${\mathit{U}}_{L\times 1}={\left[\frac{\partial l}{\partial {\beta}_{j}}\right]}_{j=1,\dots ,L}=\underset{i=1}{\overset{N}{\Sigma}}{\mathit{x}}_{i}({y}_{i}-{\pi}_{i}),$$

(B1)

and its variance is

$${\mathit{V}}_{L\times L}=\underset{i=1}{\overset{N}{\Sigma}}({\mathit{x}}_{i}-\stackrel{-}{x}){({\mathit{x}}_{i}-\stackrel{-}{x})}^{\prime}{\pi}_{i}(1-{\pi}_{i}).$$

(B2)

The score statistic is then

$$T={\stackrel{~}{U}}^{\prime}{\stackrel{~}{V}}^{-1}\stackrel{~}{U},$$

(B3)

where $\stackrel{~}{U}={\Sigma}_{i=1}^{N}{\mathit{x}}_{i}({y}_{i}-\stackrel{-}{y})=(N-1){\Sigma}_{\mathit{yx}}$ and $\stackrel{~}{V}={\widehat{\sigma}}_{y}^{2}{\Sigma}_{i=1}^{N}({\mathit{x}}_{i}-\stackrel{-}{x}){({\mathit{x}}_{i}-\stackrel{-}{x})}^{\prime}=(N-1){\Sigma}_{\mathit{yy}}{\Sigma}_{\mathit{xx}}$ are evaluated under the null hypothesis. ^{−1} is the generalized inverse of **. The covariance terms Σ**_{yx}, Σ_{xx}, and ${\widehat{\sigma}}_{y}^{2}={\Sigma}_{\mathit{yy}}$ are defined in the subsection of canonical correlation. The statistic *T* has an approximate χ^{2} distribution with degrees of freedom equal to the rank of the matrix **, which is usually the number of markers. The statistic ***T* can be expressed as

$$\begin{array}{cc}\hfill T& ={\stackrel{~}{U}}^{\prime}{\stackrel{~}{V}}^{-1}\stackrel{~}{U}\hfill \\ \hfill & \phantom{\rule{1em}{0ex}}\phantom{\rule{thickmathspace}{0ex}}=(N-1){\Sigma}_{\mathit{yx}}^{\prime}{\left[(N-1){\Sigma}_{\mathit{yy}}{\Sigma}_{\mathit{xx}}\right]}^{-1}(N-1){\Sigma}_{\mathit{yx}}\hfill \\ \hfill & \phantom{\rule{1em}{0ex}}\phantom{\rule{thickmathspace}{0ex}}=(N-1)\frac{{\Sigma}_{\mathit{yx}}^{\prime}{\Sigma}_{\mathit{xx}}^{-1}{\Sigma}_{\mathit{yx}}}{{\widehat{\sigma}}_{y}^{2}}=(N-1){R}^{2}.\hfill \end{array}$$

(B4)

Thus, the score statistic derived from logistic regression is equivalent to the canonical correlation in equation (12), and is asymptotically equivalent to the likelihood ratio test statistic for logistic regression.

ELECTRONIC-DATABASE INFORMATION

URLs for data in this article are as follows

R package, http://www.r-project.org/

Package haplo.stats, http://mayoresearch.mayo.edu/mayo/research/schaid_lab/software.cfm

The HapMap website, http://www.hapmap.org

The R code for simulations is available by sending an email to W-YL, wt.ude.utn@60024829d

- Abecasis GR, Cherny SS, Cardon LR. The impact of genotyping error on family-based analysis of quantitative traits. Eur J Hum Genet. 2001;9:130–134. [PubMed]
- Anderson MJ. A new method for non-parametric multivariate analysis of variance. Aust Ecol. 2001;26:32–46.
- Balding DJ. A tutorial on statistical methods for population association studies. Nat Rev Genet. 2006;7:781–791. [PubMed]
- Bardel C, Darlu P, Genin E. Clustering of haplotypes based on phylogeny: how good a strategy for association testing? Eur J Hum Genet. 2006;14:202–206. [PubMed]
- Bourgain C, Genin E, Holopainen P, Mustalahti K, Maki M, Partanen J, Clerget-Darpoux F. Use of closely related affected individuals for the genetic study of complex diseases in founder populations. Am J Hum Genet. 2001;68:154–159. [PubMed]
- Cardon LR, Abecasis GR. Using haplotype blocks to map human complex trait loci. Trends Genet. 2003;19:135–140. [PubMed]
- Chapman JM, Cooper JD, Todd JA, Clayton DG. Detecting disease associations due to linkage disequilibrium using haplotype tags: A class of tests and the determinants of statistical power. Hum Hered. 2003;56:18–31. [PubMed]
- Cheng KF, Lin WJ. Simultaneously correcting for population stratification and for genotyping error in case-control association studies. Am J Hum Genet. 2007;81:726–743. [PubMed]
- Cheung VG, Nelson SF. Genomic mismatch scanning identifies human genomic DNA shared identical by descent. Genomics. 1998;47:1–6. [PubMed]
- Clayton DG, Chapman JM, Cooper JD. Use of unphased multilocus genotype data in indirect association studies. Genet Epidemiol. 2004;27:415–428. [PubMed]
- Cordell HJ, Clayton DG. A unified stepwise regression procedure for evaluating the relative effects of polymorphisms within a gene using case/control or family data: application to HLA in type 1 diabetes. Am J Hum Genet. 2002;70:124–141. [PubMed]
- Devlin B, Roeder K, Wasserman L. Genomic control for association studies: a semiparametric test to detect excess-haplotype sharing. Biostatistics. 2000;1:369–387. [PubMed]
- Dobson AJ. An Introduction to Generalized Linear Models. 2nd edition Chapman & Hall/CRC; New York: 2002.
- Douglas JA, Boehnke M, Gillanders E, Trent JM, Gruber SB. Experimentally derived haplotypes substantially increase the efficiency of linkage disequilibrium studies. Nat Genet. 2001;28:361–364. [PubMed]
- Goeman JJ, van de Geer SA, de Kort F, van Houwelingen HC. A global test for groups of genes: testing association with a clinical outcome. Bioinformatics. 2004;20:93–99. [PubMed]
- Gower JC. Some distance properties of latent root and vector methods used in multivariate analysis. Biometrika. 1966;53:325–338.
- Grant GR, Manduchi E, Cheung VG, Ewens WJ. Significance testing for direct identity-by-descent mapping. Ann Hum Genet. 1999;63:441–454. [PubMed]
- Heidema AG, Boer JMA, Nagelkerke N, Mariman ECM, van der ADL, Feskens EJM. The challenge for genetic epidemiologists: how to analyze large numbers of SNPs in relation to complex diseases. BMC Genet. 2006;7:23–37. [PMC free article] [PubMed]
- Hudson RR. Generating samples under a Wright-Fisher neutral model of genetic variation. Bioinformatics. 2002;18:337–338. [PubMed]
- Humphreys K, Iles MM. Fine-scale mapping in case-control samples using locus scoring and haplotype-sharing methods. BMC Genet. 2005;6:S74. [PMC free article] [PubMed]
- Klei L, Roeder K. Testing for association based on excess allele sharing in a sample of related cases and controls. Hum Genet. 2007;121:549–557. [PubMed]
- Kuusisto J, Koivisto K, Kervinen K, Mykkanen L, Helkala E-L, Vanhanen M, Hanninen T, Pyorala K, Kesaniemi YA, Piekkinen P, Laasko M. Association of apolipoprotein E phenotypes with late onset Alzheimer's disease: population based study. Br Med J. 1994;309:636–638. [PMC free article] [PubMed]
- McArdle BH, Anderson MJ. Fitting multivariate models to community data: a comment on distance-based redundancy analysis. Ecology. 2001;82:290–297.
- Morris RW, Kaplan NL. Testing for association with a case-parents design in the presence of genotyping errors. Genet Epidemiol. 2004;26:142–154. [PubMed]
- North BV, Sham PC, Knight J, Martin ER, Curtis D. Investigation of the ability of haplotype association and logistic regression to identify associated susceptibility loci. Ann Hum Genet. 2006;70:893–906. [PubMed]
- Rencher AC. Methods of Multivariate Analysis. 2nd edition Wiley; New York: 2002.
- Schaid DJ. Relative efficiency of ambiguous vs. directly measured haplotype frequencies. Genet Epidemiol. 2002;23:426–443. [PubMed]
- Schaid DJ. Evaluating associations of haplotypes with traits. Genet Epidemiol. 2004a;27:348–364. [PubMed]
- Schaid DJ. The complex genetic epidemiology of prostate cancer. Hum Mol Genet. 2004b;13:R103–R121. [PubMed]
- Schaid DJ. Power and sample size for testing associations of haplotypes with complex traits. Ann Hum Genet. 2006;70:116–130. [PubMed]
- Schaid DJ, Rowland CM, Tines DE, Jacobson RM, Poland GA. Score tests for association between traits and haplotypes when linkage phase is ambiguous. Am J Hum Genet. 2002;70:425–434. [PubMed]
- Schaid DJ, McDonnell SK, Hebbring SJ, Cunningham JM, Thibodeau SN. Nonparametric tests of association of multiple genes with human disease. Am J Hum Genet. 2005;76:780–793. [PubMed]
- Sha Q, Chen H-S, Zhang S. A new association test using haplotype similarity. Genet Epidemiol. 2007;31:577–593. [PubMed]
- The International HapMap Consortium The International HapMap Project. Nature. 2003;426:789–796. [PubMed]
- The International HapMap Consortium A haplotype map of the human genome. Nature. 2005;437:1299–1320. [PMC free article] [PubMed]
- Tintle NL, Ahn K, Mendell NR, Gordon D, Finch SJ. Characteristics of replicated single-nucleotide polymorphism genotypes from COGA: Affymetrix and Center for Inherited Disease Research. BMC Genet. 2005;6:S154. [PMC free article] [PubMed]
- Tzeng J-Y, Zhang D. Haplotype-based association analysis via variance-components score test. Am J Hum Genet. 2007;81:927–938. [PubMed]
- Tzeng J-Y, Devlin B, Wasserman L, Roeder K. On the identification of disease mutations by the analysis of haplotype similarity and goodness of fit. Am J Hum Genet. 2003a;72:891–902. [PubMed]
- Tzeng J-Y, Byerley W, Devlin B, Roeder K, Wasserman L. Outlier detection and false discovery rates for whole-genome DNA matching. J Am Stat Assoc. 2003b;98:236–246.
- Tzeng J-Y, Chang S-M, Zhang D, Thomas DC, Davidian M. Regression-based multi-marker analysis for genome-wide association studies using haplotype similarity. 2007 submitted for publication.
- Van der Meulen MA, Te Meerman GJ. Haplotype sharing analysis in affected individuals from nuclear families with at least one affected offspring. Genet Epidemiol. 1997;14:915–920. [PubMed]
- Wessel J, Schork NJ. Generalized genomic distance-based regression methodology for multilocus association analysis. Am J Hum Genet. 2006;79:792–806. [PubMed]
- Yuan A, Yue Q, Apprey V, Bonney G. Detecting disease gene in DNA haplotype sequences by nonparametric dissimilarity test. Hum Genet. 2006;120:253–261. [PubMed]
- Zaykin DV, Westfall PH, Young SS, Karnoub MA, Wagner MJ, Ehm MG. Testing association of statistically inferred haplotypes with discrete and continuous traits in samples of unrelated individuals. Hum Hered. 2002;53:79–91. [PubMed]

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |