Home | About | Journals | Submit | Contact Us | Français |

**|**HHS Author Manuscripts**|**PMC2860453

Formats

Article sections

- Abstract
- 1 INTRODUCTION
- 2 METHODS
- 3 APPLICATION TO HAPLOTYPE ANALYSIS
- 4 SIMULATION STUDIES
- 5 DATA ANALYSIS
- 6 DISCUSSION
- REFERENCES

Authors

Related links

J Am Stat Assoc. Author manuscript; available in PMC 2010 April 27.

Published in final edited form as:

J Am Stat Assoc. 2009 September 1; 104(487): 1251–1260.

doi: 10.1198/jasa.2009.tm08507PMCID: PMC2860453

NIHMSID: NIHMS112179

Zuoheng Wang, Department of Statistics, University of Chicago, Chicago, IL 60637 (E-mail: ude.ogacihcu.notlag@gnawz).

See other articles in PMC that cite the published article.

We propose an incomplete-data, quasi-likelihood framework, for estimation and score tests, which accommodates both dependent and partially-observed data. The motivation comes from genetic association studies, where we address the problems of estimating haplotype frequencies and testing association between a disease and haplotypes of multiple tightly-linked genetic markers, using case-control samples containing related individuals. We consider a more general setting in which the complete data are dependent with marginal distributions following a generalized linear model. We form a vector **Z** whose elements are conditional expectations of the elements of the complete-data vector, given selected functions of the incomplete data. Assuming that the covariance matrix of **Z** is available, we form an optimal linear estimating function based on **Z**, which we solve by an iterative method. This approach addresses key difficulties in the haplotype frequency estimation and testing problems in related individuals: (1) dependence that is known but can be complicated; (2) data that are incomplete for structural reasons, as well as possibly missing, with different amounts of information for different observations; (3) the need for computational speed in order to analyze large numbers of markers; (4) a well-established null model, but an alternative model that is unknown and is problematic to fully specify in related individuals. For haplotype analysis, we give sufficient conditions for consistency and asymptotic normality of the estimator and asymptotic χ^{2} null distribution of the score test. We apply the method to test for association of haplotypes with alcoholism in the GAW 14 COGA data set.

Genetic association analysis is an important tool for searching for genetic variants that play a role in human complex disease. Typical approaches compare the frequency distribution of marker alleles, genotypes or haplotypes between cases and controls. A genetic *marker* is a segment of DNA sequence with an identifiable physical location, and it is said to be *polymorphic* if its sequence varies across the population. Each of the variant forms of a polymorphic marker is called an *allele*. For autosomal genetic markers, a human inherits two copies of the marker, one from each parent. An individual’s *genotype* at a marker is the observation of the two alleles of the individual at that marker. The *kinship coefficient* between a pair of related individuals is determined by their relationship and is the probability that, at any given marker, a randomly chosen pair of alleles, one from each individual, is identical by descent, i.e. is an inherited copy of the same founder allele. For example, the kinship coefficient for a parent-offspring or sibling pair is 1/4, while that for a grandparent-grandchild, avuncular, or half-sibling pair is 1/8. In genetic association analysis, the most widely-used type of marker is currently the *single nucleotide polymorphism* (SNP), which is a DNA sequence variation that occurs at a single nucleotide and generally has two alleles. When *q* markers are considered simultaneously, the ordered *q*-tuple of alleles inherited from the same parent defines a *haplotype*. SNP haplotypes are more informative for disease mapping than are single SNPs. By capturing the dependence structure among multiple markers, haplotypes potentially provide information on untyped variants as well as on interactions among tightly-linked, typed variants. However, standard genotyping methods do not provide the information of the individual’s two haplotypes. Only the unphased genotype (i.e. the observation of the two alleles at each marker, but not their parental origin) is generated. As a consequence, we are faced with an incomplete-data problem for haplotype analysis.

There is a vast literature on statistical methods for analysis of haplotype association with genetic traits in samples of unrelated individuals; a relatively recent review paper is Schaid (2004). For related individuals, the approaches can roughly be divided into two types: “case-control,” which is the type of approach we pursue, and “family-based.” Family-based haplotype association tests, such as FBAT (Horvath et al. 2004), condition on both trait values and parental genotypes. When some or all parental genotype data are missing, the analysis is conditional on sufficient statistics for parental genotypes. Generally, this conditioning approach has the advantage of making the analysis more robust to population structure and the disadvantage of losing power compared to case-control studies (Risch and Teng 1998). Family-based methods also typically require genotype data for family members of an affected individual. Thus they do not allow unrelated individuals in the analysis. In addition to being more powerful, case-control analyses are less restrictive than family-based approaches, because they can allow but do not require genotype data for relatives of affected individuals. For case-control studies that include related individuals, where general relationship types are allowed, methods for single-marker allele frequency estimation (McPeek, Wu, and Ober 2004) and for single-marker association testing (Slager and Schaid 2001; Bourgain et al. 2003; Thornton and McPeek 2007) have been proposed. Haplotype-based association analysis for case-control samples consisting of related individuals has been considered by Browning et al. (2005) and Zhang, Schneider, Ober, and McPeek (2005).

Even for single-marker genetic analysis, sampling of related individuals can introduce complicated dependence structure. In addition, current genetic association analyses typically consider on the order of 10^{4} to 10^{6} markers throughout the genome, putting a premium on computational speed of inference. Furthermore, although the genetic model of the disease is typically unknown, the first and second moment structure of genotype data, under the null hypothesis of no association and no linkage, is a known function of the pedigree information and the unknown allele frequency. For these reasons, quasi-likelihood estimation and quasi-likelihood score tests have proved useful for single-marker, case-control, genetic association studies with related individuals (Bourgain et al. 2003; McPeek et al. 2004; Thornton and McPeek 2007). Quasi-likelihood inference (reviewed in McCullagh and Nelder 1989, chap. 9; Heyde 1997) was introduced by Wedderburn (1974) and is closely connected to the theory of optimal estimating functions (Godambe 1960). To extend previous quasi-likelihood methods to haplotype-based case-control association testing in related individuals, we need to allow for both incomplete data and dependence of observations. Recent work on estimating equations in the context of incomplete data includes Heyde and Morton (1996) and Elashoff and Ryan (2004).

We propose an incomplete-data quasi-likelihood (IQL) framework to address both dependent and partially-observed data. Under a generalized linear model for the marginal distributions of the complete data, we form an optimal linear estimating function, which we call an incomplete-data quasi-likelihood score (IQLS) function, based on certain conditional expectations of the elements of the complete-data vector. In doing this, we take advantage of the availability of a correctly-specified covariance matrix in the genetics applications we consider. We provide detailed development of the method for the specific contexts of haplotype frequency estimation and testing for association of haplotypes with a genetic disease, and we apply it to test for association of haplotypes with alcoholism in the GAW 14 COGA data set.

Our genetics applications involve genotypes and haplotypes, which are modeled using dependent binomial random variables whose parameters depend on covariates. In this section, we develop our IQLS methods for a more general setting of dependent observations, in which the marginal distribution of each observation is a generalized linear model.

Suppose that the complete data consist of pairs (*Y _{i}*,

We emphasize the following key features of the inference problems we consider: (1) The effects of interest can be described by a model for **μ** as a known function of the unknown parameter **β**. (2) The matrix **Σ** is also a known function of **β**. This latter point distinguishes our setting from, for example, that assumed in the GEE approach (Liang and Zeger 1986; Zeger and Liang 1986) in which **Σ** is generally unknown. The assumption that **Σ** is a known function of **β** is applicable to practical applications in genetics such as allele and haplotype frequency estimation and association mapping.

In our setting, the quasi-likelihood score function is

$$\mathbf{U}(\mathit{\beta})={\mathbf{D}}^{T}{\mathbf{\Sigma}}^{-1}(\mathbf{Y}-\mathit{\mu}(\mathit{\beta})).$$

(1)

Here **D** is the derivative matrix **μ**/**β**. Even when the marginal log-likelihood *l _{Yi}* is not correctly specified, provided that

We are particularly interested in the case when the complete data **Y** are partially observed. In that case, let **I** be the observed data (incomplete data). In some contexts, it might be natural to express **I** as **I** = (*I*_{1}, …, *I _{n}*), where

We use the following approach to form incomplete-data estimating functions. We define **Z** = (*Z*_{1}, …, *Z _{n}*)

$${\mathbf{U}}_{I}={\mathbf{F}}^{T}{\mathbf{\Omega}}^{-1}(\mathbf{Z}-\mathit{\mu}),$$

(2)

where **Ω**^{−1} is replaced by the Moore-Penrose generalized inverse if **Ω** is singular. An IQL estimator for **β** is a solution of **U**_{I}(**β**) = 0. We can typically obtain a solution by an iterative algorithm that is analogous to Newton-Raphson with Fisher scoring (Wedderburn 1974). Starting at **β**^{(0)} sufficiently close to , we iterate according to

$${\mathit{\beta}}^{\left(t+1\right)}={\mathit{\beta}}^{\left(t\right)}+{[{({\mathbf{F}}^{T}{\mathbf{\Omega}}^{-1}\mathbf{F})}^{-1}{\mathbf{F}}^{T}{\mathbf{\Omega}}^{-1}(\mathbf{Z}-\mathit{\mu})]}_{\mathit{\beta}={\mathit{\beta}}^{\left(t\right)}},$$

(3)

where **β**^{(t)} is the parameter estimate at the *t*th iteration.

The IQLS function has information

$$\mathbf{\Lambda}\left({\mathbf{U}}_{I}\right)=\text{Cov}\left({\mathbf{U}}_{I}\right)=-E\left(\partial {\mathbf{U}}_{I}/\partial \mathit{\beta}\right)={\mathbf{F}}^{T}{\mathbf{\Omega}}^{-1}\mathbf{F}$$

(4)

and is the optimal estimating function within the class of linear, unbiased, estimating functions *H _{I}* = {

Consider the special case when **I** can be expressed as **I** = (*I*_{1}, …, *I _{n}*), where

One would expect that the optimal choice of *f _{i}*(

Heyde and Morton (1996) consider the problem of quasi-likelihood estimation with incomplete data. Suppose that **Q** is the complete-data quasi-score within some convex class *H _{Y}* of unbiased, square-integrable, complete-data, estimating equations. Given a convex class

Elashoff and Ryan (2004) give an approach to constructing an estimating function for incomplete data based on an estimating function for complete data, with a practical computational method for obtaining the estimate. Applying the Elashoff-Ryan approach to our setting would yield estimating function **U**_{ER} = *E*(**U**|**I**) = **D**^{T}**Σ**^{−1}[*E*(**Y**|**I**)−**μ**], with information **Λ**(**U**_{ER}) = **F**^{T}**Σ**^{−1}**D**(**D**^{T}**Σ**^{−1}**Ω****Σ**^{−1}**D**)^{−1}**D**^{T}**Σ**^{−1}**F**. One drawback of applying the Elashoff-Ryan approach to the haplotype inference problems we consider is that it is not always practical to obtain *E*(**Y**|**I**). Another consideration is that **U**_{ER} is, in general, sub-optimal in the class of linear unbiased estimating functions {**H**^{T}(**β**)[*E*(**Y**|**I**)−**μ**(**β**)] : **H**(**β**) is an *n* × *r* matrix}. In our setting, when it is practical to obtain *E*(**Y**|**I**), the IQLS function will be optimal in that class. Thus, for example, if is the IQL estimator and solves **U**_{ER} = 0, then the relative efficiency of **c**^{T} compared to the **c**^{T} (or equivalently, the ratio of non-centrality parameters for testing in the direction specified by **c**), is [**c**^{T}**Λ**(**U**_{I})**c**]/[**c**^{T}**Λ**(**U**_{ER})**c**] ≥ 1, for every choice of real vector **c** of length *r*, where **Λ**(**U**_{I}) is given by Equation (4).

We also consider the problem of hypothesis testing in the case when the unknown parameters **β** can be partitioned as **β** = (**γ**^{T},**α**^{T})^{T}, where **γ** are the parameters of interest and **α** are nuisance parameters. Suppose we want to test the null hypothesis that **γ** = **γ**_{0}. We partition the vector **U**_{I} into ${\mathbf{U}}_{I\gamma}={\mathbf{F}}_{\gamma}^{T}{\mathbf{\Omega}}^{-1}(\mathbf{Z}-\mathit{\mu})$ and ${\mathbf{U}}_{{I}_{\alpha}}={\mathbf{F}}_{\alpha}^{T}{\mathbf{\Omega}}^{-1}(\mathbf{Z}-\mathit{\mu})$, where **F**_{γ} = −*E*((**Z** − **μ**)/**γ**) and **F**_{α} = −*E*((**Z** − **μ**)/**α**). Under the null hypothesis, the nuisance parameter **α** is estimated by _{0} which solves **U**_{Iα} = **0** when **γ** = **γ**_{0}. Consider the information matrix **Λ**(**U**_{I}) of Equation (4), which we assume is positive definite. We can think of **Λ** as being composed of four blocks: **Λ**_{γγ} which is the (**γ,γ**)th diagonal block of **Λ**, **Λ**_{αα} which is the (**α,α**)th diagonal block of **Λ**, and two off-diagonal blocks: **Λ**_{γα} and **Λ**_{αγ}, which are transposes of each other. Similarly, we partition **Λ**^{−1} into **Λ**^{γγ}, **Λ**^{αα}, and **Λ**^{γα}, **Λ**^{αγ}. Then, in analogue to the classical likelihood-based score statistic in the presence of nuisance parameters, the IQLS statistic has the following form

$$W={\mathbf{U}}_{I\mathrm{\gamma}}^{T}({\mathbf{\gamma}}_{0},{\widehat{\mathit{\alpha}}}_{0}){\mathbf{\Lambda}}^{\mathrm{\gamma}\mathrm{\gamma}}({\mathbf{\gamma}}_{0},{\widehat{\mathbf{\alpha}}}_{0}){\mathbf{U}}_{I\mathrm{\gamma}}({\mathbf{\gamma}}_{0},{\widehat{\mathit{\alpha}}}_{0}),$$

(5)

where **Λ**^{γγ} can be expressed as ${\mathbf{\Lambda}}^{\mathrm{\gamma}\mathrm{\gamma}}={({\mathbf{\Lambda}}_{\mathrm{\gamma}\mathrm{\gamma}}-{\mathbf{\Lambda}}_{\mathrm{\gamma}\alpha}{\mathbf{\Lambda}}_{\alpha \alpha}^{-1}{\mathbf{\Lambda}}_{\alpha \mathrm{\gamma}})}^{-1}$. More explicitly, we obtain the following equivalent form of the IQLS test statistic

$$W={\{{(\mathbf{Z}-\mathit{\mu})}^{T}{\mathbf{\Omega}}^{-1}{\mathbf{F}}_{\mathrm{\gamma}}{[{\mathbf{F}}_{\gamma}^{\mathbf{T}}{\mathbf{\Omega}}^{-1}{\mathbf{F}}_{\mathrm{\gamma}}-{\mathbf{F}}_{\mathrm{\gamma}}^{T}{\mathbf{\Omega}}^{-1}{\mathbf{F}}_{\alpha}{({\mathbf{F}}_{\alpha}^{T}{\mathbf{\Omega}}^{-1}{\mathbf{F}}_{\alpha})}^{-1}{\mathbf{F}}_{\alpha}^{T}{\mathbf{\Omega}}^{-1}{\mathbf{F}}_{\mathrm{\gamma}}]}^{-1}{\mathbf{F}}_{\gamma}^{\mathrm{T}}{\mathbf{\Omega}}^{-1}(\mathbf{Z}-\mathit{\mu})\}}_{(\mathbf{\gamma},\mathbf{\alpha})=({\mathbf{\gamma}}_{0},{\widehat{\mathit{\alpha}}}_{0})},$$

where the entire right-hand side is evaluated at (**γ, α**) = (**γ**_{0}, _{0}. Under regularity conditions, the asymptotic null distribution of *W* is χ^{2} with degrees of freedom equal to the dimension of **γ**. (For the haplotype association testing problem, sufficient conditions are given in the Appendix.) Note that in order to obtain *W*, it is not necessary to have a fully specified joint alternative distribution. It is sufficient to provide **μ** under the alternative and (**Z**/**β**)_{(γ,α)=(γ0,0)}, which, in the haplotype testing application, can be derived under relatively weak assumptions on the alternative model.

We apply the IQLS method to the problems of haplotype frequency estimation (sub- section 3.1) and testing for association between haplotypes and a binary genetic trait (subsection 3.2) in samples that contain related individuals. The following general set-up applies to both problems.

Suppose we have a sample of *m* outbred individuals, some of whom may be related with relationships specified by known pedigrees. (We assume a random sampling scheme under which the event that any particular set of individuals in the population is sampled is conditionally independent of their haplotypes, given the pedigree information and covariate values **x**.) Consider a set of *q* ordered, tightly-linked SNPs, where each SNP is a binary marker with alleles labeled 0 and 1. (Here, *tightly-linked* means that recombination among the SNPs is observed only over large time scales of many generations and is not observed within families.) Let = {**h**_{1}, …, **h**_{a+1}} {0, 1}^{q} be the set of *a* + 1 distinct possible observed haplotypes for the *q* SNPs. In practice, we can have *a* + 1 < 2^{q} when certain haplotypes do not occur in nature. For simplicity, we assume that is known. The complete-data vector of length *n* = *ma* is taken to be **Y** = (*Y*_{11}, …, *Y*_{1a}, …, *Y*_{m1}, …, *Y _{ma}*)

For individual *i*, 1 ≤ *i* ≤ *m*, let **H**_{i} = (**H**_{i1}, **H**_{i2}) denote *i*’s ordered pair of haplotypes. Instead of observing **H**_{i} for each *i*, we observe only the *q*-vector of SNP genotypes **G**_{i} = **H**_{i1}+**H**_{i2}, provided *i*’s genotype is not missing at any marker. If *i*’s genotype is missing at one or more markers, then the corresponding entries of the vector **G**_{i} are not directly observed. Let **M** be the *mq*-vector **M** = (*M*_{11}, …, *M*_{1q}, …, *M*_{m1}, …, *M _{mq}*), where

We first consider the problem of estimating the haplotype frequency distribution **p** = (*p*_{1}, …, *p _{a}*)

With incomplete data, we form **Z** = (*Z*_{11}, …, *Z*_{1a}, …, *Z*_{m1}, …,*Z _{ma}*)

We consider the problem of testing for association between a binary trait, which we call the *phenotype* (typically a disease with some genetic component), and haplotypes for a set of tightly-linked SNPs. We use the IQLS approach to extend the single-marker, complete-data *M _{QLS}* test of Thornton and McPeek (2007) to the

For association studies of complex traits in samples of related individuals, there is an “enrichment effect,” namely, that affected individuals with affected relatives have a higher expected frequency of the alleles that increase susceptibility for a complex disease than do affected individuals who do not have affected relatives. We allow for this enrichment effect by including **Ã** = **KA** instead of **A** as a covariate in our generalized linear model, where **K** is the kinship matrix. We take *Y _{ij}* ~ .5 × Bin(2,

$$\text{logit}({p}_{ij})={\alpha}_{j}+{\tilde{A}}_{i}{\mathrm{\gamma}}_{j}={\alpha}_{j}+{\mathrm{\gamma}}_{j}{\displaystyle \sum _{k=1}^{m}2{\varphi}_{ik}{A}_{k}.}$$

(6)

The 2*a* × 1 parameter vector is **β** = (**γ**^{T}, **α**^{T})^{T} where **γ** = (γ_{1}, …, γ_{a})^{T} is the parameter of interest, which measures trait-haplotype association, and **α** = (α_{1}, …, α_{a})^{T} is a nuisance parameter whose entries are the logits of the haplotype frequencies unconditional on phenotype. The form of Equation (6) is justified by an asymptotic argument applied to a single-gene model with effect size tending to zero (see Thornton and McPeek 2007). Note that the definition of **Ã** can be easily extended to allow for contributions from relatives who are phenotyped but not genotyped, by extending the summation over *k* in Equation (6) to include these additional relatives. The null hypothesis of no trait-haplotype association is *H*_{0} : **γ** = **0**, while the alternative is **γ** ≠ **0**. Under *H*_{0}, assuming no linkage, the first and second moment structure for **Y** is the same as in the previous subsection. Applying Equation (5) to the special case of complete data (**Z** = **Y**), we obtain a quasi-likelihood score statistic that is equal to the *M _{QLS}* test statistic of Thornton and McPeek (2007).

To form the *M _{IQLS}* test statistic with incomplete data, we consider the same two choices of

When expected counts of some haplotypes are small, one can usually increase power by pooling haplotypes to reduce the degrees of freedom, *a*. In the analysis of 5-SNP haplotypes in section 5, we combine haplotypes by a variation on the “clustering” method of Tzeng, Devlin, Wasserman, and Roeder (2003). This involves grouping rare haplotypes with more common ones that differ at a single site until every category has estimated frequency ≥ .05, while rare haplotypes that differ at two or more sites from every other sampled haplotype are pooled to form a new category.

To clarify the relationship of IQL to likelihood-based haplotype methods (e.g. Lin and Zeng (2006), developed for unrelated individuals), we describe a close connection between *M _{IQLS}* and the score test based on a retrospective likelihood. Assume without loss of generality that the first

$$P\left(\mathbf{A}|\mathbf{Y}\right)=P\left(\mathbf{X}|\mathbf{Y}\right)={\displaystyle \prod _{i=1}^{\rho}\frac{{e}^{\left({\beta}_{0}+{\pi}_{i}\right){X}_{i}}}{1+{e}^{\left({\beta}_{0}+{\pi}_{i}\right)}},}$$

(7)

where ${\pi}_{i}={\displaystyle {\sum}_{j=1}^{a}{r}_{j}{Y}_{ij}}$, with the parameter **r** = (*r*_{1}, …, *r _{a}*)

$$P\left(\mathbf{A}|\mathbf{Y}\right)=P\left(\mathbf{X}|\mathbf{Y}\right)={P}_{0}\left(\mathbf{X}\right)c\left(\mathbf{Y},\mathbf{r}\right){\displaystyle \prod _{i=1}^{\rho}{e}^{{\pi}_{i}{X}_{i}}},$$

(8)

where $c(\mathbf{Y},\mathbf{r})={[{\sum}_{\mathbf{x}\in {\{0,1\}}^{\rho}}{P}_{0}\left(\mathbf{x}\right){\prod}_{i=1}^{\rho}\text{exp}\left({\pi}_{i}{x}_{i}\right)]}^{-1}$ is the normalizing constant and *P*_{0}(**X**) is the joint distribution of **X** when **r** = **0**. Here *P*_{0}(**X**) allows for dependence among the *X _{i}*’s and need not be specified. We assume that

$${L}_{I}=P\left(\mathbf{I}|\mathbf{A}\right)={\displaystyle \sum _{\mathbf{Y}}P\left(\mathbf{Y}|\mathbf{A}\right)P(\mathbf{I}|\mathbf{Y})=}{\displaystyle \sum _{\mathbf{Y}\subset \mathbf{I}}P\left(\mathbf{Y}|\mathbf{A}\right)=}{\displaystyle \sum _{\mathbf{Y}\subset \mathbf{I}}{L}_{Y}},$$

(9)

where the second and third summations are over all complete-data vectors **Y** that are compatible with the observed data **I**. As a consequence, the incomplete haplotype data retrospective likelihood score function, * _{I}* satisfies

$${\stackrel{.}{l}}_{I}=E({\stackrel{.}{l}}_{Y}|\mathbf{I}),$$

(10)

where * _{Y}* is the complete haplotype data retrospective likelihood score function. Using the above observations, it is easy to verify that when the nuisance parameter,

In section 4, we perform simulation studies to assess our IQL haplotype frequency estimators and *M _{IQLS}* disease-haplotype association tests, and we compare them to a few other tests and estimators, which are described in this subsection.

In our IQLS haplotype tests and estimators, calculation of the covariance matrix **Ω** is one of the more challenging aspects. Particularly in the estimation context, it is natural to wonder how much is gained by accounting for relatedness of individuals in the calculation of **Ω**. Therefore, we also consider what we call the *equal-weight* (EW) estimating function, **U**_{EW}. Let **Z**_{i} = (*Z*_{i1}, …, *Z _{ia}*)

Because of its close connection to prior work (Elashoff and Ryan 2004; Browning et al. 2005; Zhang et al. 2005), the estimating function, **U**_{CDW}, which we call the *complete-data-weight* (CDW) estimating function, is also considered. We define ${\mathbf{U}}_{CDW}={\displaystyle {\sum}_{i=1}^{m}{w}_{i}{\mathbf{F}}_{i}^{T}{\mathbf{\Omega}}_{i}^{-1}({\mathbf{Z}}_{i}-{\mathit{\mu}}_{i})}$, where ${w}_{i}\phantom{\rule{thinmathspace}{0ex}}={({1}_{m}^{T}{\mathbf{K}}^{-1})}_{i}$. In the special case of complete data, we have **U**_{CDW} = **U**, where **U** is the complete-data, quasi-likelihood score function given in subsection 3.1. The CDW estimator is closely related to the estimator of Zhang et al. (2005), who applied essentially the same estimating function to a somewhat different context of control haplotype-frequency estimation based on the non-transmitted haplotypes of parent/affected-child trios, under the further constraint that the haplotype frequencies follow a Markov model. Furthermore, **U**_{CDW} is equivalent to the estimating function, **U**_{ER}, of Elashoff and Ryan (2004) in the special case when both of the following hold: (1) *f _{ij}*(

For complete-data, case-control association testing with related individuals, Bourgain et al. (2003) formed a quasi-likelihood score test statistic (*W _{QLS}*) using Equation (5), but instead of modeling the mean by Equation (6), they used the simpler model

$$\text{logit}({p}_{ij})={\alpha}_{j}+{X}_{i}{\mathrm{\gamma}}_{j},$$

(11)

where *X _{i}* = 1 if

To assess MSE of the IQL haplotype frequency estimators and type I error and power of the IQLS haplotype association tests, we consider three study designs, which are similar to those of Thornton and McPeek (2007). In the first, 60 outbred, three-generation, 16-person pedigrees are sampled, where 20 of the pedigrees have 4 affected individuals, 20 have 5, and 20 have 6. The pattern of affected and unaffected individuals varies randomly according to the trait models described in the next paragraph. In each sampled pedigree, phenotypes for all 16 individuals are observed. An individual’s genotypes are observed if and only if at least 30% of the individual’s siblings, parents, and offspring are affected. The second study design is similar to the first with two differences: (1) an additional 200 unrelated, unaffected controls are included in the study; and (2) for each individual in a sampled pedigree, the individual’s genotypes are observed if and only if at least half of the individual’s siblings, parents, and offspring are affected. The third study design consists of 120 pedigrees of the above structure, where exactly 40 have 1 affected individual, 40 have 2, and 40 have 3. Phenotypes are observed for all individuals, and genotypes are observed for all affected individuals plus two unaffected individuals randomly chosen in each pedigree. To assess the impact of parental genotype information, we further simulate parental genotypes for all genotyped individuals and compare the results with and without this additional parental genotype information.

We simulate under multilocus trait models I and III of Thornton and McPeek (2007), where we replace their SNP 2 with a trio, *T*_{2}, of tightly-linked SNPs for which there are 6 possible haplotypes. We replace alleles 1 and 0 of their SNP 2 by the sets of haplotypes {101, 110} and {000, 001, 010, 100}, respectively, for *T*_{2}. Table 1 gives haplotype frequencies for *T*_{2} with resulting population prevalence, *k*, and sibling risk ratio, λ_{s} = *k _{s}*/

In Table 2, we compare 6 incomplete-data, haplotype-frequency estimators from section 3, where the subscript *a* or *b* denotes the choice for *f _{ij}*(

Ratios of MSE for estimation of haplotype frequency *p*, based on 5,000 simulated replicates, where, for each estimator, the ratio of its MSE to the MSE of *IQL*_{b} is reported

*IQL _{b}*, which incorporates parental genotypes and uses optimal weights, has the smallest MSE in all cases.

We compare score tests based on **U**_{I}, **U**_{EW}, and **U**_{CDW}, where we denote the resulting tests by *IQLS*, *EWS*, and *CDWS*, respectively. We consider each of these with options (a) and (b) for *f _{ij}*, where these are denoted by subscripts

To further investigate the effect of the choice of *f _{ij}* on the power of the IQLS tests, in addition to choices (a) and (b) described above, we would ideally like to compare the choice

To assess type I error, association is tested with haplotypes at a trio of tightly-linked SNPs that are unlinked and unassociated with any causal variant, where individuals are sampled according to design 1 with model Ia. Based on 5,000 replicates, we find that for all 14 tests, type I error is not significantly different from the nominal when *α* = .05 or .01 (results not shown).

In Table 3, we compare power to detect trait-haplotype association at the trio *T*_{2} of tightly-linked SNPs that are associated with the trait, under designs 2 and 3. In every setting, the *M _{IQLS}* tests perform best, with

Interestingly, although the *CDW* approach performs substantially better than the *EW* approach for estimation, it performs noticeably worse for association testing, probably because the *EW* approach tends to have a built-in enrichment effect in these study designs, i.e. it effectively gives extra weight to affected individuals with affected relatives in the sample. For the *IQLS* and *CDWS* tests, use of the mean model of Equation (6) always provided at least as much power, and usually more, than did use of the mean model of Equation (11), illustrating the enrichment effect. For the *EWS* test, because there is already a built-in enrichment effect, one can often have worse power with Equation (6) than with Equation (11).

We apply our IQLS score tests to study association of haplotypes with alcoholism in the GAW 14 COGA data set described in detail in Edenberg et al. (2005). Of the 1,614 individuals from 143 families, we include only those coded as “white, non-Hispanic.” After exclusion of two individuals with apparent relationship errors and five who appear to have mixed or misclassified ethnicity, 1,067 individuals remain. We classify as affected those who are “affected with ALDX1” or who have “symptoms of ALDX1.” We classify as unaffected those coded as “pure unaffected,” and we classify as missing phenotype those coded as “never drank alcohol.” By these criteria, there are 855 affected, 199 unaffected and 13 missing phenotype. Among these 1,067 individuals, there are 1,023 individuals with available SNP data. We analyze the 10,407 autosomal SNPs in the data set having minor allele frequency > .01.

We test for association with haplotypes based every adjacent pair of SNPs, using the *M _{IQLS}* with

We follow up the 2-SNP haplotype analysis with a 5-SNP haplotype analysis, based on a sliding window, across all SNPs on each of the three chromosomes on which we detected association. For each 5-SNP window, we reduce the degrees of freedom of the test by clustering haplotypes as described in subsection 3.2. In addition, we compute *p*-values using a wide range of values for the estimated population prevalence, *k*. Table 5 gives the three windows for which the resulting *M _{IQLS}* association test has

The first set of five SNPs in Table 5 is in the intron of a candidate tumor suppressor gene that encodes protein tyrosine phosphatase receptor type G (PTPRG [Affymetrix; Entrez Gene]), located at 3p21-p14. The second set of five SNPs in Table 5 is 251 kb from the gene encoding vestigial like 3 (VGLL3 [Affymetrix; Entrez Gene]), located at 3p12.1. The set of five SNPs on chromosome 2 is in the intron of the gene encoding solute carrier family 8 member 1 (SLC8A1 [Affymetrix; Entrez Gene]), located at 2p23-p22. The two SNPs on chromosome 10 are 178 kb from the gene encoding sortilin-related VPS10 domain containing receptor 1 (SORCS1 [Affymetrix; Entrez Gene]), located at 10q23-q25. To our knowledge, none of these genes are obvious candidates for alcoholism.

Genetic association analysis is widely used to identify genetic variants that underlie human complex disease. Haplotypes are more informative for disease mapping than are single markers, because haplotypes can provide additional information on untyped variants as well as on interactions among tightly-linked, typed variants. However, we generally have only partial haplotype information on each individual. Dependence among observations in an association study, due to relatedness of sampled individuals, commonly arises when families sampled for a linkage study are included in an association study. We introduce the IQLS method for assessment of disease association with haplotypes in case-control studies in which some individuals may be related. The method is applicable to completely general samples of related and unrelated individuals, including large complex pedigrees as well as samples that combine small families with unrelated individuals. The IQLS method effectively addresses the problems of dependent and partially informative observations in a way that is computationally practical and that does not require specification of a full joint distribution.

In simulation studies, the *M _{IQLS}* tests perform well for a range of complex disease models. We consider the impact of each of the following features of our

The IQLS method for parameter estimation and hypothesis testing is presented in a somewhat more general setting in which the complete data are dependent with marginal distributions following a generalized linear model. When the first and second moment structure can be correctly specified, the quasi-likelihood score method provides a way to draw inference without specifying the full joint distribution. This is advantageous when the MLE is very difficult to calculate or when there is in-sufficient information to construct a likelihood function. With incomplete data, we form an IQLS function, which is a quasi-likelihood score function based on a vector **Z** whose elements are conditional expectations of the elements of the complete-data vector, given selected functions of the incomplete data. We use an iterative algorithm to obtain the IQL parameter estimates. The computational complexity of the method will depend on the choice of conditioning events. There is wide latitude in this choice, provided that the covariance matrix and expected derivative of the conditional expectations are correctly specified. (For the testing problem, these need only be evaluated under the null hypothesis.) An estimated variance-covariance matrix for the IQL estimator, appropriately adjusted for incomplete data, is obtained as part of the iterative algorithm. Under regularity conditions, the IQL estimator is consistent and asymptotically normal with minimum asymptotic variance among estimators that are solutions to linear unbiased estimating equations based on **Z**.

This study was supported by National Institutes of Health grants R01HG001645 and U01HL084715. Data were provided by the Collaborative Study on the Genetics of Alcoholism (U10AA008401) through the Genetics Analysis Workshop (R01GM031575). The authors thank the editor, an associate editor and two anonymous referees for helpful comments.

For haplotype analysis, we give sufficient conditions for consistency and asymptotic normality of the IQL estimators and asymptotic *χ*^{2} null distribution of the IQLS score tests. Suppose that the sampled individuals form *b* independent families, where we let *b* → ∞. (Note that a single individual unrelated to anyone else in the sample is considered a special case of a family.) Define a family configuration to be a particular constellation of typed relatives with a particular pattern of missing genotype information at the given tightly-linked SNPs. Assume that there are only a finite number of possible configurations allowed, indexed by *c* = 1, …, *C*. Each configuration has an associated information **Λ**_{c} for a family having that configuration. Assume that there exist *p _{c}* ≥ 0, for 1 ≤

Zuoheng Wang, Department of Statistics, University of Chicago, Chicago, IL 60637 (E-mail: ude.ogacihcu.notlag@gnawz).

Mary Sara McPeek, Departments of Statistics and Human Genetics, University of Chicago, Chicago, IL 60637 (E-mail: ude.ogacihcu.notlag@keepcm)..

- Bourgain C, Hoffjan S, Nicolae R, Newman D, Steiner L, Walker K, Reynolds R, Ober C, McPeek MS. Novel Case-Control Test in a Founder Population Identifies P-Selectin as an Atopy-Susceptibility Locus. American Journal of Human Genetics. 2003;73:612–626. [PubMed]
- Browning SR, Briley JD, Briley LP, Chandra G, Charnecki JH, Ehm MG, Johansson KA, Jones BJ, Karter AJ, Yarnall DP, Wagner MJ. Case-Control Single-Marker and Haplotypic Association Analysis of Pedigree Data. Genetic Epidemiology. 2005;28:110–122. [PubMed]
- Edenberg HJ, Bierut LJ, Boyce P, Cao M, Cawley S, Chiles R, Doheny KF, Hansen M, Hinrichs T, Jones K, Kelleher M, Kennedy GC, Liu G, Marcus G, McBride C, et al. Description of the Data from the Collaborative Study on the Genetics of Alcoholism (COGA) and Single-Nucleotide Polymorphism Genotyping for Genetic Analysis Workshop 14. BMC Genetics. 2005;6 Suppl. 1:S2. [PMC free article] [PubMed]
- Elashoff M, Ryan L. An EM Algorithm for Estimating Equations. Journal of Computational and Graphical Statistics. 2004;13:48–65.
- Godambe VP. An Optimum Property of Regular Maximum Likelihood Estimation. The Annals of Mathematical Statistics. 1960;31:1208–1211.
- Heyde CC. Quasi-likelihood and its Application: a General Approach to Optimal parameter Estimation. New York: Springer; 1997.
- Heyde CC, Morton R. Quasi-Likelihood and Generalizing the EM Algorithm. Journal of the Royal Statistical Society, Ser. B. 1996;58:317–327.
- Horvath S, Xu X, Lake SL, Silverman EK, Weiss ST, Laird NM. Family-Based Tests for Associating Haplotypes with General Phenotype Data: Application to Asthma Genetics. Genetic Epidemiology. 2004;26:61–69. [PubMed]
- Lehmann EL, Casella G. Theory of Point Estimation. 2nd edition. New York: Springer; 1998.
- Liang KY, Zeger SL. Longitudinal Data Analysis Using Generalized Linear Models. Biometrika. 1986;73:13–22.
- Lin DY, Zeng D. Likelihood-Based Inference on Haplotype Effects in Genetic Association Studies (with discussion) Journal of the American Statistical Association. 2006;101:89–118.
- McCullagh P, Nelder JA. Generalized Linear Models. 2nd edition. New York: Chapman and Hall; 1989.
- McPeek MS, Wu X, Ober C. Best Linear Unbiased Allele-Frequency Estimation in Complex Pedigrees. Biometrics. 2004;60:359–367. [PubMed]
- Nicolae DL. Quantifying the Amount of Missing Information in Genetic Association Studies. Genetic Epidemiology. 2006;30:703–717. [PubMed]
- Risch N, Teng J. The Relative Power of Family-Based and Case-Control Designs for Linkage Disequilibrium Studies of Complex Human Diseases I: DNA Pooling. Genome Research. 1998;8:1273–1288. [PubMed]
- Schaid DJ. Evaluating Associations of Haplotypes with Traits. Genetic Epidemiology. 2004;27:348–364. [PubMed]
- Slager SL, Schaid DJ. Evaluation of Candidate Genes in Case-Control Studies: a Statistical Method to Account for Related Subjects. American Journal of Human Genetics. 2001;68:1457–1462. [PubMed]
- Thornton T, McPeek MS. Case-Control Association Testing with Related Individuals: A More Powerful Quasi-Likelihood Score Test. American Journal of Human Genetics. 2007;81:321–337. [PubMed]
- Tzeng J-Y, Devlin B, Wasserman L, Roeder K. On the Identification of Disease Mutations by the Analysis of Haplotype Similarity and Goodness of Fit. American Journal of Human Genetics. 2003;72:891–902. [PubMed]
- Wedderburn RWM. Quasi-likelihood Functions, Generalized Linear Models, and the Gauss-Newton Method. Biometrika. 1974;61:439–447.
- Xie M, Yang Y. Asymptotics for Generalized Estimating Equations with Large Cluster Sizes. The Annals of Statistics. 2003;31:310–347.
- Yuan K-H. A Theorem of Uniform Convergence of Stochastic Functions with Applications. Journal of Multivariate Analysis. 1997;62:100–109.
- Yuan K-H, Jennrich RI. Asymptotics of Estimating Equations under Natural Conditions. Journal of Multivariate Analysis. 1998;65:245–260.
- Zeger SL, Liang KY. Longitudinal Data Analysis for Discrete and Continuous Outcomes. Biometrics. 1986;42:121–130. [PubMed]
- Zhang J, Schneider D, Ober C, McPeek MS. Multilocus linkage disequilibrium mapping by the decay of haplotype sharing with samples of related individuals. Genetic Epidemiology. 2005;29:128–140. [PubMed]

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |