PMCCPMCCPMCC

Search tips
Search criteria 

Advanced

 
Logo of nihpaAbout Author manuscriptsSubmit a manuscriptHHS Public Access; Author Manuscript; Accepted for publication in peer reviewed journal;
 
J Am Stat Assoc. Author manuscript; available in PMC 2010 April 27.
Published in final edited form as:
J Am Stat Assoc. 2009 September 1; 104(487): 1251–1260.
doi:  10.1198/jasa.2009.tm08507
PMCID: PMC2860453
NIHMSID: NIHMS112179

An Incomplete-Data Quasi-likelihood Approach to Haplotype-Based Genetic Association Studies on Related Individuals

Zuoheng Wang, Ph.D. Candidate and Mary Sara McPeek, Professor

Abstract

We propose an incomplete-data, quasi-likelihood framework, for estimation and score tests, which accommodates both dependent and partially-observed data. The motivation comes from genetic association studies, where we address the problems of estimating haplotype frequencies and testing association between a disease and haplotypes of multiple tightly-linked genetic markers, using case-control samples containing related individuals. We consider a more general setting in which the complete data are dependent with marginal distributions following a generalized linear model. We form a vector Z whose elements are conditional expectations of the elements of the complete-data vector, given selected functions of the incomplete data. Assuming that the covariance matrix of Z is available, we form an optimal linear estimating function based on Z, which we solve by an iterative method. This approach addresses key difficulties in the haplotype frequency estimation and testing problems in related individuals: (1) dependence that is known but can be complicated; (2) data that are incomplete for structural reasons, as well as possibly missing, with different amounts of information for different observations; (3) the need for computational speed in order to analyze large numbers of markers; (4) a well-established null model, but an alternative model that is unknown and is problematic to fully specify in related individuals. For haplotype analysis, we give sufficient conditions for consistency and asymptotic normality of the estimator and asymptotic χ2 null distribution of the score test. We apply the method to test for association of haplotypes with alcoholism in the GAW 14 COGA data set.

Keywords: Estimating function, Association testing, Dependent data, Missing data, Generalized linear model, Score test

1 INTRODUCTION

Genetic association analysis is an important tool for searching for genetic variants that play a role in human complex disease. Typical approaches compare the frequency distribution of marker alleles, genotypes or haplotypes between cases and controls. A genetic marker is a segment of DNA sequence with an identifiable physical location, and it is said to be polymorphic if its sequence varies across the population. Each of the variant forms of a polymorphic marker is called an allele. For autosomal genetic markers, a human inherits two copies of the marker, one from each parent. An individual’s genotype at a marker is the observation of the two alleles of the individual at that marker. The kinship coefficient between a pair of related individuals is determined by their relationship and is the probability that, at any given marker, a randomly chosen pair of alleles, one from each individual, is identical by descent, i.e. is an inherited copy of the same founder allele. For example, the kinship coefficient for a parent-offspring or sibling pair is 1/4, while that for a grandparent-grandchild, avuncular, or half-sibling pair is 1/8. In genetic association analysis, the most widely-used type of marker is currently the single nucleotide polymorphism (SNP), which is a DNA sequence variation that occurs at a single nucleotide and generally has two alleles. When q markers are considered simultaneously, the ordered q-tuple of alleles inherited from the same parent defines a haplotype. SNP haplotypes are more informative for disease mapping than are single SNPs. By capturing the dependence structure among multiple markers, haplotypes potentially provide information on untyped variants as well as on interactions among tightly-linked, typed variants. However, standard genotyping methods do not provide the information of the individual’s two haplotypes. Only the unphased genotype (i.e. the observation of the two alleles at each marker, but not their parental origin) is generated. As a consequence, we are faced with an incomplete-data problem for haplotype analysis.

There is a vast literature on statistical methods for analysis of haplotype association with genetic traits in samples of unrelated individuals; a relatively recent review paper is Schaid (2004). For related individuals, the approaches can roughly be divided into two types: “case-control,” which is the type of approach we pursue, and “family-based.” Family-based haplotype association tests, such as FBAT (Horvath et al. 2004), condition on both trait values and parental genotypes. When some or all parental genotype data are missing, the analysis is conditional on sufficient statistics for parental genotypes. Generally, this conditioning approach has the advantage of making the analysis more robust to population structure and the disadvantage of losing power compared to case-control studies (Risch and Teng 1998). Family-based methods also typically require genotype data for family members of an affected individual. Thus they do not allow unrelated individuals in the analysis. In addition to being more powerful, case-control analyses are less restrictive than family-based approaches, because they can allow but do not require genotype data for relatives of affected individuals. For case-control studies that include related individuals, where general relationship types are allowed, methods for single-marker allele frequency estimation (McPeek, Wu, and Ober 2004) and for single-marker association testing (Slager and Schaid 2001; Bourgain et al. 2003; Thornton and McPeek 2007) have been proposed. Haplotype-based association analysis for case-control samples consisting of related individuals has been considered by Browning et al. (2005) and Zhang, Schneider, Ober, and McPeek (2005).

Even for single-marker genetic analysis, sampling of related individuals can introduce complicated dependence structure. In addition, current genetic association analyses typically consider on the order of 104 to 106 markers throughout the genome, putting a premium on computational speed of inference. Furthermore, although the genetic model of the disease is typically unknown, the first and second moment structure of genotype data, under the null hypothesis of no association and no linkage, is a known function of the pedigree information and the unknown allele frequency. For these reasons, quasi-likelihood estimation and quasi-likelihood score tests have proved useful for single-marker, case-control, genetic association studies with related individuals (Bourgain et al. 2003; McPeek et al. 2004; Thornton and McPeek 2007). Quasi-likelihood inference (reviewed in McCullagh and Nelder 1989, chap. 9; Heyde 1997) was introduced by Wedderburn (1974) and is closely connected to the theory of optimal estimating functions (Godambe 1960). To extend previous quasi-likelihood methods to haplotype-based case-control association testing in related individuals, we need to allow for both incomplete data and dependence of observations. Recent work on estimating equations in the context of incomplete data includes Heyde and Morton (1996) and Elashoff and Ryan (2004).

We propose an incomplete-data quasi-likelihood (IQL) framework to address both dependent and partially-observed data. Under a generalized linear model for the marginal distributions of the complete data, we form an optimal linear estimating function, which we call an incomplete-data quasi-likelihood score (IQLS) function, based on certain conditional expectations of the elements of the complete-data vector. In doing this, we take advantage of the availability of a correctly-specified covariance matrix in the genetics applications we consider. We provide detailed development of the method for the specific contexts of haplotype frequency estimation and testing for association of haplotypes with a genetic disease, and we apply it to test for association of haplotypes with alcoholism in the GAW 14 COGA data set.

2 METHODS

2.1 Incomplete-data Quasi-likelihood Score Function

Our genetics applications involve genotypes and haplotypes, which are modeled using dependent binomial random variables whose parameters depend on covariates. In this section, we develop our IQLS methods for a more general setting of dependent observations, in which the marginal distribution of each observation is a generalized linear model.

Suppose that the complete data consist of pairs (Yi,xi), i = 1, …, n, where the Yi’s are scalar responses and the xi’s are r × 1 covariate vectors that are treated as fixed in the analysis. Suppose the Yi’s are dependent with each Yi having marginal log-likelihood given by lYi(yi|xi, β) = yiθiai) + b(yi), where θi = ui), u is a known injective function, ηi=xiTβ, and β is an r × 1 parameter vector. The marginal score function with respect to the natural parameter θi is [partial differential]lYi/[partial differential]θi = yi − μi, where μi = EYi = ai). Furthermore E(2lYi/θi2)=a¨(θi)=Var(Yi)=Var(lYi/θi). Let Y = (Y1, …, Yn)T, μ = (μ1, …, μn)T, Σ = Cov(Y), and assume that Σ can be specified in terms of β.

We emphasize the following key features of the inference problems we consider: (1) The effects of interest can be described by a model for μ as a known function of the unknown parameter β. (2) The matrix Σ is also a known function of β. This latter point distinguishes our setting from, for example, that assumed in the GEE approach (Liang and Zeger 1986; Zeger and Liang 1986) in which Σ is generally unknown. The assumption that Σ is a known function of β is applicable to practical applications in genetics such as allele and haplotype frequency estimation and association mapping.

In our setting, the quasi-likelihood score function is

U(β)=DTΣ1(Yμ(β)).
(1)

Here D is the derivative matrix [partial differential]μ/[partial differential]β. Even when the marginal log-likelihood lYi is not correctly specified, provided that μ and Σ are the correctly-specified moments of Y, the quasi-likelihood score function is optimal in the class of linear unbiased estimating functions {HT(β)(Yμ(β)) : H(β) is an n × r matrix}, in the sense that it has maximum information, where the information of an unbiased estimating function G is defined as Λ(G) = E([partial differential]G/[partial differential]β)T[E(GGT)]−1 E([partial differential]G/[partial differential]β), and the maximum is taken in the partial order of nonnegative definite matrices. In our setting, Λ(U) = DTΣ−1D. A quasi-likelihood estimator [beta] is obtained by solving U(β) = 0. Under regularity conditions, [beta] is consistent and asymptotically efficient.

We are particularly interested in the case when the complete data Y are partially observed. In that case, let I be the observed data (incomplete data). In some contexts, it might be natural to express I as I = (I1, …, In), where Ii is a deterministic function of Yi, but this is not a requirement. Indeed, we need not even require that I be a deterministic function of Y at all. For instance, we could allow I to consist of noisy observations on functions of Y.

We use the following approach to form incomplete-data estimating functions. We define Z = (Z1, …, Zn)T by Zi = E(Yi|fi(I)), where fi(I) is some function of I. As a special case, we could have fi(I) = I for all 1 ≤ in, leading to Z = E(Y|I). In the general case, we note that E(Z) = μ, and we define Ω = Cov(Z) and F = −E([partial differential](Zμ)/[partial differential]β). We then form an IQLS function,

UI=FTΩ1(Zμ),
(2)

where Ω−1 is replaced by the Moore-Penrose generalized inverse if Ω is singular. An IQL estimator for β is a solution of UI(β) = 0. We can typically obtain a solution [beta] by an iterative algorithm that is analogous to Newton-Raphson with Fisher scoring (Wedderburn 1974). Starting at β(0) sufficiently close to [beta], we iterate according to

β(t+1)=β(t)+[(FTΩ1F)1FTΩ1(Zμ)]β=β(t),
(3)

where β(t) is the parameter estimate at the tth iteration.

The IQLS function has information

Λ(UI)=Cov(UI)=E(UI/β)=FTΩ1F
(4)

and is the optimal estimating function within the class of linear, unbiased, estimating functions HI = {HT(β)(Zμ) : H is an n × r matrix}. The optimality depends on correct specification of the first and second moments of Z and does not require correct specification of other aspects of the distribution. Yuan and Jennrich (1998) give conditions for existence, consistency, and asymptotic normality of estimators from a wide class of estimating functions that includes Equation (2). The conditions of Xie and Yang (2003) for GEE estimators can also be adapted to IQL estimators. For the haplotype estimation and association testing problems in section 3, we directly verify the conditions of Yuan and Jennrich (1998).

Consider the special case when I can be expressed as I = (I1, …, In), where Ii = gi(Yi), with gi a deterministic, measurable function not depending on β. Then the following three properties follow (e.g. from Theorem 1.5.8 of Lehmann and Casella 1998), where lIi is the log-likelihood of Ii: (1) [partial differential]lIi/[partial differential]θi = E([partial differential]lYi/[partial differential]θi| Ii), (2) E([partial differential]lIi/[partial differential]θi) = 0, and (3) −E([partial differential]2lIi/[partial differential]2θi) = Var([partial differential]lIi/[partial differential]θi). Therefore, if we choose fi(I) = Ii, then Zi−μi = [partial differential]lIi/[partial differential]θi, and Zμ can be viewed as the vector of incomplete-data, marginal score functions. In this special case, UI arises naturally as the optimal, linear, un-biased, estimating function based on the vector of incomplete-data, marginal score functions, while the complete-data estimating function U was the optimal, linear, unbiased, estimating function based on the vector of complete-data, marginal score functions. If, in addition, we restrict Yi, 1 ≤ in to be independent, then UI is the incomplete-data, likelihood score function, and the IQL estimators are the maximum likelihood estimators (MLEs).

One would expect that the optimal choice of fi(I) in the IQLS function would be fi(I) = I, 1 ≤ in. However, in some problems (for example, in many instances of the haplotype inference problems we consider), E(Y|I) is not computationally practical to obtain, so the IQLS function with fi(I) = I is not practically useful in such problems. In that case, we ideally want to choose fi so that (1) E(Yi|I) is well-approximated by E(Yi|fi(I)) and (2) computation of Zi = E(Yi|fi(I)) and the resulting Ω and F are not onerous.

2.2 Comparison with Previous Approaches

Heyde and Morton (1996) consider the problem of quasi-likelihood estimation with incomplete data. Suppose that Q is the complete-data quasi-score within some convex class HY of unbiased, square-integrable, complete-data, estimating equations. Given a convex class HI of unbiased, square-integrable, incomplete-data, estimating equations, where HI is a linear subspace of HY, Heyde and Morton (1996) show that the quasi-score within HI is the projection of Q into HI. However, dependence across observations would often lead to HI [not subset] HY (e.g. in the haplotype analysis setting we consider), in which case the projection result does not hold.

Elashoff and Ryan (2004) give an approach to constructing an estimating function for incomplete data based on an estimating function for complete data, with a practical computational method for obtaining the estimate. Applying the Elashoff-Ryan approach to our setting would yield estimating function UER = E(U|I) = DTΣ−1[E(Y|I)−μ], with information Λ(UER) = FTΣ−1D(DTΣ−1ΩΣ−1D)−1DTΣ−1F. One drawback of applying the Elashoff-Ryan approach to the haplotype inference problems we consider is that it is not always practical to obtain E(Y|I). Another consideration is that UER is, in general, sub-optimal in the class of linear unbiased estimating functions {HT(β)[E(Y|I)−μ(β)] : H(β) is an n × r matrix}. In our setting, when it is practical to obtain E(Y|I), the IQLS function will be optimal in that class. Thus, for example, if [beta] is the IQL estimator and beta solves UER = 0, then the relative efficiency of cT [beta] compared to the cT beta (or equivalently, the ratio of non-centrality parameters for testing in the direction specified by c), is [cTΛ(UI)c]/[cTΛ(UER)c] ≥ 1, for every choice of real vector c of length r, where Λ(UI) is given by Equation (4).

2.3 Hypothesis Testing

We also consider the problem of hypothesis testing in the case when the unknown parameters β can be partitioned as β = (γT,αT)T, where γ are the parameters of interest and α are nuisance parameters. Suppose we want to test the null hypothesis that γ = γ0. We partition the vector UI into UIγ=FγTΩ1(Zμ) and UIα=FαTΩ1(Zμ), where Fγ = −E([partial differential](Zμ)/[partial differential]γ) and Fα = −E([partial differential](Zμ)/[partial differential]α). Under the null hypothesis, the nuisance parameter α is estimated by [alpha]0 which solves UIα = 0 when γ = γ0. Consider the information matrix Λ(UI) of Equation (4), which we assume is positive definite. We can think of Λ as being composed of four blocks: Λγγ which is the (γ,γ)th diagonal block of Λ, Λαα which is the (α,α)th diagonal block of Λ, and two off-diagonal blocks: Λγα and Λαγ, which are transposes of each other. Similarly, we partition Λ−1 into Λγγ, Λαα, and Λγα, Λαγ. Then, in analogue to the classical likelihood-based score statistic in the presence of nuisance parameters, the IQLS statistic has the following form

W=UIγT(γ0,α^0)Λγγ(γ0,α^0)UIγ(γ0,α^0),
(5)

where Λγγ can be expressed as Λγγ=(ΛγγΛγαΛαα1Λαγ)1. More explicitly, we obtain the following equivalent form of the IQLS test statistic

W={(Zμ)TΩ1Fγ[FγTΩ1FγFγTΩ1Fα(FαTΩ1Fα)1FαTΩ1Fγ]1FγTΩ1(Zμ)}(γ,α)=(γ0,α^0),

where the entire right-hand side is evaluated at (γ, α) = (γ0, [alpha]0. Under regularity conditions, the asymptotic null distribution of W is χ2 with degrees of freedom equal to the dimension of γ. (For the haplotype association testing problem, sufficient conditions are given in the Appendix.) Note that in order to obtain W, it is not necessary to have a fully specified joint alternative distribution. It is sufficient to provide μ under the alternative and ([partial differential]Z/[partial differential]β)(γ,α)=(γ0,[alpha]0), which, in the haplotype testing application, can be derived under relatively weak assumptions on the alternative model.

3 APPLICATION TO HAPLOTYPE ANALYSIS

We apply the IQLS method to the problems of haplotype frequency estimation (sub- section 3.1) and testing for association between haplotypes and a binary genetic trait (subsection 3.2) in samples that contain related individuals. The following general set-up applies to both problems.

Suppose we have a sample of m outbred individuals, some of whom may be related with relationships specified by known pedigrees. (We assume a random sampling scheme under which the event that any particular set of individuals in the population is sampled is conditionally independent of their haplotypes, given the pedigree information and covariate values x.) Consider a set of q ordered, tightly-linked SNPs, where each SNP is a binary marker with alleles labeled 0 and 1. (Here, tightly-linked means that recombination among the SNPs is observed only over large time scales of many generations and is not observed within families.) Let H = {h1, …, ha+1} [subset or is implied by] {0, 1}q be the set of a + 1 distinct possible observed haplotypes for the q SNPs. In practice, we can have a + 1 < 2q when certain haplotypes do not occur in nature. For simplicity, we assume that H is known. The complete-data vector of length n = ma is taken to be Y = (Y11, …, Y1a, …, Ym1, …, Yma)T, where Yij = .5 × (the number of copies of haplotype hj held by individual i), for 1 ≤ im, 1 ≤ ja + 1. Under the assumption of Hardy-Weinberg equilibrium, the marginal model for Yij is .5 × Bin(2,pij), where pij>0,j=1a+1pij=1. The resulting marginal log-likelihood in terms of the natural parameter, θij = 2logit(pij), is lYij (y) = yθij − 2 log (1 + exp(θij/2)) plus a term not depending on θij, and we use the canonical link function, so that θij=2xijTβ.

For individual i, 1 ≤ im, let Hi = (Hi1, Hi2) denote i’s ordered pair of haplotypes. Instead of observing Hi for each i, we observe only the q-vector of SNP genotypes Gi = Hi1+Hi2, provided i’s genotype is not missing at any marker. If i’s genotype is missing at one or more markers, then the corresponding entries of the vector Gi are not directly observed. Let M be the mq-vector M = (M11, …, M1q, …, Mm1, …, Mmq), where Mij is the indicator of missingness of i’s genotype at the jth SNP. We assume that genotype data are missing at random, i.e. conditional on the pedigree information and on x, the vector M is independent of Y, and we condition the analyses on M, x, and the pedigree information. We denote the observed data by I = (I1, …, Im), where Ii is the information of the values of the nonmissing entries of Gi.

3.1 Haplotype Frequency Estimation

We first consider the problem of estimating the haplotype frequency distribution p = (p1, …, pa)T for the population from which the m sampled individuals are drawn, where we assume P(Hik = hj) = pj, for i = 1, …, m, k = 1,2, j = 1, …, a + 1 and pa+1=1i=1api. We take xij = ej, the vector of length a having jth entry equal to 1 and all other entries 0, for 1 ≤ im, 1 ≤ ja. Then θij/2 = logit(pj) = βj, so βj is a one-to-one function of pj for 1 ≤ ja. We have μ = 1m [multiply sign in circle] p, where 1m is an m-vector of 1’s, and [multiply sign in circle] is Kronecker product. Mendelian inheritance in the pedigree implies that Σ = K [multiply sign in circle] B, where Km× m, which is the correlation matrix of the m-vector (Y1j, …, Ymj)T for each 1 ≤ ja, is given by Ki,j = 2ϕij, where ϕij is the kinship coefficient between individuals i and j, with 2ϕii = 1. (We will also refer to K as the kinship matrix.) Here Ba×a, which is the covariance matrix of the a-vector Yi = (Yi1, …, Yia)T for each 1 ≤ im, is given by Bij = pi(1 − pi)/2 if i = j and −pipj/2 if ij. We have D = 1m [multiply sign in circle] ([partial differential]p/[partial differential]β), where [partial differential]p/[partial differential]β is the a × a diagonal matrix with jth diagonal element pj(1 − pj). The complete-data quasi-likelihood score function is U(β) = ([partial differential]p/[partial differential]β)U(p), where U(p) is the quasi-likelihood score function in terms of p (used by McPeek et al. 2004) which is given by U(p)=i=1mwiB1(Yip), where wi=(1mTK1)i. U(p) and U(β) are equivalent in that they produce the same estimators of p. Here K is invertible provided the sample does not contain MZ twins (Thornton and McPeek 2007), and the Moore-Penrose generalized inverse is used when the sample does contain MZ twins. For the case when complete haplotype data are available, the quasi-likelihood estimator, [p with hat] which solves U(p) = 0, is given explicitly by McPeek et al. (2004).

With incomplete data, we form Z = (Z11, …, Z1a, …, Zm1, …,Zma)T, where Zij = E[Yij|fij(I)], for some functions fij, 1 ≤ im, 1 ≤ ja. In the special case of unrelated individuals, provided fij(I) contains at least the information of Ii, the IQL haplotype frequency estimator is the same as the MLE, and, when the iterative algorithm of Equation (3) is applied to solve UI(p) = 0, it is the same as the EM algorithm. For large complex pedigrees, the optimal choice, fij(I) = I, is not practical from a computational point of view. We consider two other choices: (a) fij(I) = Ii, the non-missing SNP genotypes of individual i, and (b) fij(I) = (Ii, Im(i), If(i)), the non-missing SNP genotypes of individuals i,m(i), and f(i), where m(i) and f(i) are the mother and father, respectively, of i. We expect choice (b) to improve efficiency of estimation over choice (a), and this is explored in the simulations of section 4. In the Appendix, we give sufficient conditions for existence, consistency, and asymptotic normality of the IQL haplotype frequency estimator.

3.2 Case-Control Association Testing: MIQLS

We consider the problem of testing for association between a binary trait, which we call the phenotype (typically a disease with some genetic component), and haplotypes for a set of tightly-linked SNPs. We use the IQLS approach to extend the single-marker, complete-data MQLS test of Thornton and McPeek (2007) to the MIQLS test based on haplotypes with incomplete data. We allow for missing phenotypes and missing genotypes. Phenotype is treated as a covariate, and the analysis is conditioned on it. Let A be the m-vector of phenotype information, with Ai = 1 − k if i is affected, −k if i is unaffected, and 0 if i’s phenotype is missing, where k is an estimate of prevalence of the disease in the relevant reference population. A can be thought of as the mean-corrected phenotype indicator with adjustment for retrospective sampling. The resulting hypothesis test will be valid for any fixed 0 < k < 1.

For association studies of complex traits in samples of related individuals, there is an “enrichment effect,” namely, that affected individuals with affected relatives have a higher expected frequency of the alleles that increase susceptibility for a complex disease than do affected individuals who do not have affected relatives. We allow for this enrichment effect by including à = KA instead of A as a covariate in our generalized linear model, where K is the kinship matrix. We take Yij ~ .5 × Bin(2,pij) as before, where we now let

logit(pij)=αj+A˜iγj=αj+γjk=1m2ϕikAk.
(6)

The 2a × 1 parameter vector is β = (γT, αT)T where γ = (γ1, …, γa)T is the parameter of interest, which measures trait-haplotype association, and α = (α1, …, αa)T is a nuisance parameter whose entries are the logits of the haplotype frequencies unconditional on phenotype. The form of Equation (6) is justified by an asymptotic argument applied to a single-gene model with effect size tending to zero (see Thornton and McPeek 2007). Note that the definition of à can be easily extended to allow for contributions from relatives who are phenotyped but not genotyped, by extending the summation over k in Equation (6) to include these additional relatives. The null hypothesis of no trait-haplotype association is H0 : γ = 0, while the alternative is γ0. Under H0, assuming no linkage, the first and second moment structure for Y is the same as in the previous subsection. Applying Equation (5) to the special case of complete data (Z = Y), we obtain a quasi-likelihood score statistic that is equal to the MQLS test statistic of Thornton and McPeek (2007).

To form the MIQLS test statistic with incomplete data, we consider the same two choices of fij as in the previous subsection. If we view Zij(γ,α) as a function of (γ, α), then the calculation of MIQLS requires Zij(0,[alpha]0), which is the same as Zij([p with hat]) in the previous subsection. We require Var(Z) only under the null hypothesis, which equals the Ω that would be obtained for the estimation problem of the previous subsection. We keep μ the same as in the complete data case (specified by inverting the logit of Equation (6)). To obtain F(0,[alpha]0) = −E([partial differential](Zμ)/[partial differential]β)(0,[alpha]0), we also need an expression for E([partial differential]Z/[partial differential]β)(0,[alpha]0). For any given alternative model, consider a power series expansion for Zij(γ, α) = E(Yij|fij(I)) around γ = 0. In fact, we only need the first-order term of the series to obtain the derivative at γ = 0. It turns out that this first-order term, which depends on Ã, will be the same for a single-gene multiplicative model and a single-gene additive model, and that is what we use to obtain E([partial differential]Z/[partial differential]β)(0,[alpha]0). The MIQLS test is the resulting score test for the null hypothesis γ = 0, obtained from Equation (5). Note that to obtain an asymptotic χa2 null distribution for MIQLS, it is sufficient to have correct specification of μ, Ω, and F under the null hypothesis (under conditions given in the Appendix).

When expected counts of some haplotypes are small, one can usually increase power by pooling haplotypes to reduce the degrees of freedom, a. In the analysis of 5-SNP haplotypes in section 5, we combine haplotypes by a variation on the “clustering” method of Tzeng, Devlin, Wasserman, and Roeder (2003). This involves grouping rare haplotypes with more common ones that differ at a single site until every category has estimated frequency ≥ .05, while rare haplotypes that differ at two or more sites from every other sampled haplotype are pooled to form a new category.

3.3 Connection between MIQLS and Retrospective Likelihood

To clarify the relationship of IQL to likelihood-based haplotype methods (e.g. Lin and Zeng (2006), developed for unrelated individuals), we describe a close connection between MIQLS and the score test based on a retrospective likelihood. Assume without loss of generality that the first ρm of the m ordered individuals have known phenotype, and for these individuals, let Xi = 1 if i is affected, and Xi = 0 if i is unaffected. Here we treat Y and A as m-vectors, as before, while X is a ρ-vector which is a function of A. We condition the analysis on the pattern of missing phenotype information among the sampled individuals. Consider the following prospective logistic regression model for phenotype given complete haplotype data:

P(A|Y)=P(X|Y)=i=1ρe(β0+πi)Xi1+e(β0+πi),
(7)

where πi=j=1arjYij, with the parameter r = (r1, …, ra)T representing association of the haplotypes with the phenotype, and with β0 = logit(k) representing logit of disease prevalence under the null hypothesis. Model (7) is not appropriate for complex traits in related individuals, because under the null hypothesis of no association, H0 : r = 0, the phenotypes of related individuals are modeled as independent, which is unrealistic. Therefore, we generalize Equation (7) to the following larger class of models, which allows dependence of the phenotypes under the null hypothesis:

P(A|Y)=P(X|Y)=P0(X)c(Y,r)i=1ρeπiXi,
(8)

where c(Y,r)=[x{0,1}ρP0(x)i=1ρexp(πixi)]1 is the normalizing constant and P0(X) is the joint distribution of X when r = 0. Here P0(X) allows for dependence among the Xi’s and need not be specified. We assume that P0(X) has the property that all the marginals are the same, i.e. that P0(Xi = 1) = ∑x[set membership]{0,1}m P0(x)xi = k for all 1 ≤ im, where this is the same prevalence, k, that we use in the definition of the phenotype vector A. Equation (8) represents exponential tilting of P0(X). Equation (7) is obtained as a special case of Equation (8) by choosing P0(X)=i=1ρexp(β0Xi)/[1+exp(β0)]. P(Y) is determined by Mendelian inheritance in pedigrees and the assumption of Hardy-Weinberg equilibrium in the general population. Combining P(Y) with Equation (8), we can obtain the retrospective likelihood for complete haplotype data, LY = P(Y|A). Under assumptions on missing genotype data stated in the beginning of section 3, the retrospective likelihood for incomplete haplotype data is modeled by

LI=P(I|A)=YP(Y|A)P(I|Y)=YIP(Y|A)=YILY,
(9)

where the second and third summations are over all complete-data vectors Y that are compatible with the observed data I. As a consequence, the incomplete haplotype data retrospective likelihood score function, lI satisfies

l.I=E(l.Y|I),
(10)

where lY is the complete haplotype data retrospective likelihood score function. Using the above observations, it is easy to verify that when the nuisance parameter, p, is known and with k chosen as above, the retrospective likelihood score test for every model in the class given by Equation (8), with related individuals and complete or incomplete data, is the same as the MIQLS with choice fij(I) = I. However, the nuisance parameter is generally not known, and in the likelihood score test, p is estimated by the MLE under the null hypothesis, while in the MIQLS, it is estimated by the computationally-simpler IQL estimator under the null hypothesis. The different estimators of the nuisance parameter, p, cause the variance portions of the two test statistics to differ as well because they account for estimation of p. For unrelated individuals, the MLE and IQL estimators of p are the same, and the retrospective likelihood score test is identical to the MIQLS, under the weak assumption that fij(I) contains at least the information of Ii for each 1 ≤ im.

3.4 Estimators and Tests to be Compared by Simulation

In section 4, we perform simulation studies to assess our IQL haplotype frequency estimators and MIQLS disease-haplotype association tests, and we compare them to a few other tests and estimators, which are described in this subsection.

In our IQLS haplotype tests and estimators, calculation of the covariance matrix Ω is one of the more challenging aspects. Particularly in the estimation context, it is natural to wonder how much is gained by accounting for relatedness of individuals in the calculation of Ω. Therefore, we also consider what we call the equal-weight (EW) estimating function, UEW. Let Zi = (Zi1, …, Zia)T be the portion of Z corresponding to individual i, μi = E(Zi), Ωi = Cov(Zi), and Fi = −E([partial differential](Ziμi)/[partial differential]β) for 1 ≤ im. Then UEW=i=1mFiTΩi1(Ziμi), which can be viewed as Equation (2) with Ω replaced by the block-diagonal matrix Ωdiag = diag(Ω1, …, Ωm). In the special case of unrelated individuals, UEW = UI, and the EW estimator coincides with the IQL estimator. When relatives are sampled, if we choose fij(I) = Ii (choice (a) of subsection 3.1), the EW estimator is the estimator obtained by applying the MLE for unrelated individuals to a sample of related individuals. We also use UEW to obtain a score test for γ = 0 by Equation (5). (Note that Λ(UEW), which is needed in Equation (5), does involve the off-diagonal blocks of Ω.)

Because of its close connection to prior work (Elashoff and Ryan 2004; Browning et al. 2005; Zhang et al. 2005), the estimating function, UCDW, which we call the complete-data-weight (CDW) estimating function, is also considered. We define UCDW=i=1mwiFiTΩi1(Ziμi), where wi=(1mTK1)i. In the special case of complete data, we have UCDW = U, where U is the complete-data, quasi-likelihood score function given in subsection 3.1. The CDW estimator is closely related to the estimator of Zhang et al. (2005), who applied essentially the same estimating function to a somewhat different context of control haplotype-frequency estimation based on the non-transmitted haplotypes of parent/affected-child trios, under the further constraint that the haplotype frequencies follow a Markov model. Furthermore, UCDW is equivalent to the estimating function, UER, of Elashoff and Ryan (2004) in the special case when both of the following hold: (1) fij(I) = I for 1 ≤ im, 1 ≤ ja and (2) FiTΩi1 is constant in 1 ≤ im and of full rank.

For complete-data, case-control association testing with related individuals, Bourgain et al. (2003) formed a quasi-likelihood score test statistic (WQLS) using Equation (5), but instead of modeling the mean by Equation (6), they used the simpler model

logit(pij)=αj+Xiγj,
(11)

where Xi = 1 if i is affected and 0 if i is unaffected, and individuals with missing phenotype are dropped from the analysis. The model of Equation (11) ignores the enrichment effect described in subsection 3.2. For any of the estimating functions, UI, UEW, or UCDW, when we form a score test for γ = 0 by Equation (5), we can use either Equation (6) or Equation (11) for the mean model. The score test constructed from UCDW with the mean model of Equation (11) is the score-test version of the composite-likelihood ratio test of Browning et al. (2005) for case-control haplotype association in related individuals.

4 SIMULATION STUDIES

To assess MSE of the IQL haplotype frequency estimators and type I error and power of the IQLS haplotype association tests, we consider three study designs, which are similar to those of Thornton and McPeek (2007). In the first, 60 outbred, three-generation, 16-person pedigrees are sampled, where 20 of the pedigrees have 4 affected individuals, 20 have 5, and 20 have 6. The pattern of affected and unaffected individuals varies randomly according to the trait models described in the next paragraph. In each sampled pedigree, phenotypes for all 16 individuals are observed. An individual’s genotypes are observed if and only if at least 30% of the individual’s siblings, parents, and offspring are affected. The second study design is similar to the first with two differences: (1) an additional 200 unrelated, unaffected controls are included in the study; and (2) for each individual in a sampled pedigree, the individual’s genotypes are observed if and only if at least half of the individual’s siblings, parents, and offspring are affected. The third study design consists of 120 pedigrees of the above structure, where exactly 40 have 1 affected individual, 40 have 2, and 40 have 3. Phenotypes are observed for all individuals, and genotypes are observed for all affected individuals plus two unaffected individuals randomly chosen in each pedigree. To assess the impact of parental genotype information, we further simulate parental genotypes for all genotyped individuals and compare the results with and without this additional parental genotype information.

We simulate under multilocus trait models I and III of Thornton and McPeek (2007), where we replace their SNP 2 with a trio, T2, of tightly-linked SNPs for which there are 6 possible haplotypes. We replace alleles 1 and 0 of their SNP 2 by the sets of haplotypes {101, 110} and {000, 001, 010, 100}, respectively, for T2. Table 1 gives haplotype frequencies for T2 with resulting population prevalence, k, and sibling risk ratio, λs = ks/k, where ks is the prevalence conditioned on having an affected sibling.

Table 1
Parameter settings for simulation models

4.1 MSE Comparison for Haplotype Frequency Estimation

In Table 2, we compare 6 incomplete-data, haplotype-frequency estimators from section 3, where the subscript a or b denotes the choice for fij(I). We generate 5,000 data sets under design 1 with model Ia. Haplotype frequencies are estimated at a trio of tightly-linked SNPs that are unlinked and unassociated with any causal variants. For each setting, we use the method of Nicolae (2006) to quantify the extent to which genotype data are informative about the given haplotype. Specifically, we measure the proportion of an individual’s haplotype-j information that is provided by his genotype as mG = Var(Zij)/[pj(1 − pj)/2], where Zij is calculated with fij(I) = Gi. The proportion of haplotype-j information on an individual obtained when the genotype data of both the individual and his parents are available is mGtrio, which equals the above ratio calculated with fij(I) = Gtrio = (Gi, Gm(i),Gf(i)).

Table 2
Ratios of MSE for estimation of haplotype frequency p, based on 5,000 simulated replicates, where, for each estimator, the ratio of its MSE to the MSE of IQLb is reported

IQLb, which incorporates parental genotypes and uses optimal weights, has the smallest MSE in all cases. CDWb performs almost as well, because Gtrio provides close to complete haplotype information (mGtrio close to 1). The use of optimal weights provides a more marked improvement over the use of complete-data weights when the amount of information is lower (IQLa vs. CDWa for which the information is mG). The EW estimators perform worse than the CDW and IQL estimators under both choices (a) and (b).

4.2 Power and Type I error of Haplotype Association Tests

We compare score tests based on UI, UEW, and UCDW, where we denote the resulting tests by IQLS, EWS, and CDWS, respectively. We consider each of these with options (a) and (b) for fij, where these are denoted by subscripts a and b, respectively, and we consider two alternative mean models, given by Equation (6) and Equation (11), which we denote by M and W, respectively. Thus, e.g., WIQLSb denotes the score test based on UI with choice (b) for fij and the mean model of Equation (11).

To further investigate the effect of the choice of fij on the power of the IQLS tests, in addition to choices (a) and (b) described above, we would ideally like to compare the choice fij(I) = I. Because this choice presents computational difficulties, we instead consider choice (c), which gives an upper bound on the power when fij(I) = I. For choice (c), if individual i’s typed relatives are a subset of {i, f(i),m(i)}, then we take fij(I) = I, and if individual i has typed relatives other than parents, we use complete haplotype data for individual i.

To assess type I error, association is tested with haplotypes at a trio of tightly-linked SNPs that are unlinked and unassociated with any causal variant, where individuals are sampled according to design 1 with model Ia. Based on 5,000 replicates, we find that for all 14 tests, type I error is not significantly different from the nominal when α = .05 or .01 (results not shown).

In Table 3, we compare power to detect trait-haplotype association at the trio T2 of tightly-linked SNPs that are associated with the trait, under designs 2 and 3. In every setting, the MIQLS tests perform best, with MIQLSc and MIQLSb having, respectively, the highest and second-highest power among all tests for all simulation settings, and with MIQLSa having highest power among the tests in which only the individual’s genotype data are conditioned on (choice (a)). This suggests that with related individuals, all three of the following are important for power: optimal weighting of individuals, accounting for the enrichment effect, and incorporating parental genotype information. The comparison of choices (a), (b), and (c) for fij in the IQLS tests indicates that little power is lost when fij(I) = (Ii, Im(i), If(i)) is used compared to the more computationally intensive or infeasible approach of using fij(I) = I. The power difference between these is no more than .05 across all simulation settings. This is not surprising, because adding parental genotype information allows one to capture most of the haplotype information for an individual, as reflected, for example, in the high values of the measure of information, mGtrio in Table 2.

Table 3
Empirical power results, based on 5,000 simulated replicates, α = .05

Interestingly, although the CDW approach performs substantially better than the EW approach for estimation, it performs noticeably worse for association testing, probably because the EW approach tends to have a built-in enrichment effect in these study designs, i.e. it effectively gives extra weight to affected individuals with affected relatives in the sample. For the IQLS and CDWS tests, use of the mean model of Equation (6) always provided at least as much power, and usually more, than did use of the mean model of Equation (11), illustrating the enrichment effect. For the EWS test, because there is already a built-in enrichment effect, one can often have worse power with Equation (6) than with Equation (11).

5 DATA ANALYSIS

We apply our IQLS score tests to study association of haplotypes with alcoholism in the GAW 14 COGA data set described in detail in Edenberg et al. (2005). Of the 1,614 individuals from 143 families, we include only those coded as “white, non-Hispanic.” After exclusion of two individuals with apparent relationship errors and five who appear to have mixed or misclassified ethnicity, 1,067 individuals remain. We classify as affected those who are “affected with ALDX1” or who have “symptoms of ALDX1.” We classify as unaffected those coded as “pure unaffected,” and we classify as missing phenotype those coded as “never drank alcohol.” By these criteria, there are 855 affected, 199 unaffected and 13 missing phenotype. Among these 1,067 individuals, there are 1,023 individuals with available SNP data. We analyze the 10,407 autosomal SNPs in the data set having minor allele frequency > .01.

We test for association with haplotypes based every adjacent pair of SNPs, using the MIQLS with k = .05. Table 4 lists three pairs of SNPs for which the MIQLS test is significant at the 5% level after Bonferroni correction. In each of these examples, the detected association is driven by a rare two-SNP haplotype that occurs only in individuals with large values of Ã. Association is not detected for any of the SNPs marginally, because the individual SNPs do not have rare alleles, and their alleles are not strongly associated with Ã. This could occur, for example, if a predisposing allele at an untyped variant were in linkage disequilibrium with the haplotype.

Table 4
Results for the COGA data based on 2-SNP haplotypes

We follow up the 2-SNP haplotype analysis with a 5-SNP haplotype analysis, based on a sliding window, across all SNPs on each of the three chromosomes on which we detected association. For each 5-SNP window, we reduce the degrees of freedom of the test by clustering haplotypes as described in subsection 3.2. In addition, we compute p-values using a wide range of values for the estimated population prevalence, k. Table 5 gives the three windows for which the resulting MIQLS association test has p-value < 10−6. The effect of k on the p-value is slight, suggesting that the results are quite robust to choice of k. If we instead use a p-value threshold of 10−5, we gain two additional significant windows, one that overlaps by 4 SNPs with the third window in Table 5 and one that overlaps by 4 SNPs with the second window in Table 5, so that the same 3 regions are still identified. Two of the regions identified by 2-SNP haplotype analysis in Table 4 are identical to regions identified by 5-SNP haplotype analysis in Table 5, with comparable p-values between the two analyses. A region on chromosome 10 was detected by 2-SNP but not 5-SNP haplotype analysis. This is a result of low linkage disequilibrium among typed SNPs in the region: r2 between SNPs tsc0041916 and tsc0048856 is .98, while r2 of either of these two with other typed SNPs is less than .16. Thus, the 3 additional SNPs, in a 5-SNP haplotype containing tsc0041916 and tsc0048856, mainly introduce noise to the analysis.

Table 5
Results for the COGA data based on 5-SNP haplotypes

The first set of five SNPs in Table 5 is in the intron of a candidate tumor suppressor gene that encodes protein tyrosine phosphatase receptor type G (PTPRG [Affymetrix; Entrez Gene]), located at 3p21-p14. The second set of five SNPs in Table 5 is 251 kb from the gene encoding vestigial like 3 (VGLL3 [Affymetrix; Entrez Gene]), located at 3p12.1. The set of five SNPs on chromosome 2 is in the intron of the gene encoding solute carrier family 8 member 1 (SLC8A1 [Affymetrix; Entrez Gene]), located at 2p23-p22. The two SNPs on chromosome 10 are 178 kb from the gene encoding sortilin-related VPS10 domain containing receptor 1 (SORCS1 [Affymetrix; Entrez Gene]), located at 10q23-q25. To our knowledge, none of these genes are obvious candidates for alcoholism.

6 DISCUSSION

Genetic association analysis is widely used to identify genetic variants that underlie human complex disease. Haplotypes are more informative for disease mapping than are single markers, because haplotypes can provide additional information on untyped variants as well as on interactions among tightly-linked, typed variants. However, we generally have only partial haplotype information on each individual. Dependence among observations in an association study, due to relatedness of sampled individuals, commonly arises when families sampled for a linkage study are included in an association study. We introduce the IQLS method for assessment of disease association with haplotypes in case-control studies in which some individuals may be related. The method is applicable to completely general samples of related and unrelated individuals, including large complex pedigrees as well as samples that combine small families with unrelated individuals. The IQLS method effectively addresses the problems of dependent and partially informative observations in a way that is computationally practical and that does not require specification of a full joint distribution.

In simulation studies, the MIQLS tests perform well for a range of complex disease models. We consider the impact of each of the following features of our MIQLS haplo-type association test: (1) incorporation of the “enrichment effect” that is characteristic of complex genetic models; (2) use of optimal weights (as opposed to complete-data or equal weights); and (3) use of parental genotype data to obtain more information about an individual’s haplotype. All three of these features seem to have a noticeable impact on power. In an analysis of the GAW 14 COGA data set, we detect significant association of haplotypes with alcoholism in a case-control sample consisting of families, where these associations are not apparent from single-SNP analysis.

The IQLS method for parameter estimation and hypothesis testing is presented in a somewhat more general setting in which the complete data are dependent with marginal distributions following a generalized linear model. When the first and second moment structure can be correctly specified, the quasi-likelihood score method provides a way to draw inference without specifying the full joint distribution. This is advantageous when the MLE is very difficult to calculate or when there is in-sufficient information to construct a likelihood function. With incomplete data, we form an IQLS function, which is a quasi-likelihood score function based on a vector Z whose elements are conditional expectations of the elements of the complete-data vector, given selected functions of the incomplete data. We use an iterative algorithm to obtain the IQL parameter estimates. The computational complexity of the method will depend on the choice of conditioning events. There is wide latitude in this choice, provided that the covariance matrix and expected derivative of the conditional expectations are correctly specified. (For the testing problem, these need only be evaluated under the null hypothesis.) An estimated variance-covariance matrix for the IQL estimator, appropriately adjusted for incomplete data, is obtained as part of the iterative algorithm. Under regularity conditions, the IQL estimator is consistent and asymptotically normal with minimum asymptotic variance among estimators that are solutions to linear unbiased estimating equations based on Z.

Acknowledgments

This study was supported by National Institutes of Health grants R01HG001645 and U01HL084715. Data were provided by the Collaborative Study on the Genetics of Alcoholism (U10AA008401) through the Genetics Analysis Workshop (R01GM031575). The authors thank the editor, an associate editor and two anonymous referees for helpful comments.

APPENDIX

For haplotype analysis, we give sufficient conditions for consistency and asymptotic normality of the IQL estimators and asymptotic χ2 null distribution of the IQLS score tests. Suppose that the sampled individuals form b independent families, where we let b → ∞. (Note that a single individual unrelated to anyone else in the sample is considered a special case of a family.) Define a family configuration to be a particular constellation of typed relatives with a particular pattern of missing genotype information at the given tightly-linked SNPs. Assume that there are only a finite number of possible configurations allowed, indexed by c = 1, …, C. Each configuration has an associated information Λc for a family having that configuration. Assume that there exist pc ≥ 0, for 1 ≤ cC with Σc pc = 1 such that P{b1i=1b1{familyiof typec}pc}=1. We require the existence of at least one c such that pc > 0 and Λc is positive definite. Under these assumptions, it is straightforward to verify conditions 4 and 5 of Yuan and Jennrich (1998) and the Theorem of Yuan (1997) which implies condition 2 of Yuan and Jennrich (1998).

Contributor Information

Zuoheng Wang, Department of Statistics, University of Chicago, Chicago, IL 60637 (E-mail: ude.ogacihcu.notlag@gnawz).

Mary Sara McPeek, Departments of Statistics and Human Genetics, University of Chicago, Chicago, IL 60637 (E-mail: ude.ogacihcu.notlag@keepcm)..

REFERENCES

  • Bourgain C, Hoffjan S, Nicolae R, Newman D, Steiner L, Walker K, Reynolds R, Ober C, McPeek MS. Novel Case-Control Test in a Founder Population Identifies P-Selectin as an Atopy-Susceptibility Locus. American Journal of Human Genetics. 2003;73:612–626. [PubMed]
  • Browning SR, Briley JD, Briley LP, Chandra G, Charnecki JH, Ehm MG, Johansson KA, Jones BJ, Karter AJ, Yarnall DP, Wagner MJ. Case-Control Single-Marker and Haplotypic Association Analysis of Pedigree Data. Genetic Epidemiology. 2005;28:110–122. [PubMed]
  • Edenberg HJ, Bierut LJ, Boyce P, Cao M, Cawley S, Chiles R, Doheny KF, Hansen M, Hinrichs T, Jones K, Kelleher M, Kennedy GC, Liu G, Marcus G, McBride C, et al. Description of the Data from the Collaborative Study on the Genetics of Alcoholism (COGA) and Single-Nucleotide Polymorphism Genotyping for Genetic Analysis Workshop 14. BMC Genetics. 2005;6 Suppl. 1:S2. [PMC free article] [PubMed]
  • Elashoff M, Ryan L. An EM Algorithm for Estimating Equations. Journal of Computational and Graphical Statistics. 2004;13:48–65.
  • Godambe VP. An Optimum Property of Regular Maximum Likelihood Estimation. The Annals of Mathematical Statistics. 1960;31:1208–1211.
  • Heyde CC. Quasi-likelihood and its Application: a General Approach to Optimal parameter Estimation. New York: Springer; 1997.
  • Heyde CC, Morton R. Quasi-Likelihood and Generalizing the EM Algorithm. Journal of the Royal Statistical Society, Ser. B. 1996;58:317–327.
  • Horvath S, Xu X, Lake SL, Silverman EK, Weiss ST, Laird NM. Family-Based Tests for Associating Haplotypes with General Phenotype Data: Application to Asthma Genetics. Genetic Epidemiology. 2004;26:61–69. [PubMed]
  • Lehmann EL, Casella G. Theory of Point Estimation. 2nd edition. New York: Springer; 1998.
  • Liang KY, Zeger SL. Longitudinal Data Analysis Using Generalized Linear Models. Biometrika. 1986;73:13–22.
  • Lin DY, Zeng D. Likelihood-Based Inference on Haplotype Effects in Genetic Association Studies (with discussion) Journal of the American Statistical Association. 2006;101:89–118.
  • McCullagh P, Nelder JA. Generalized Linear Models. 2nd edition. New York: Chapman and Hall; 1989.
  • McPeek MS, Wu X, Ober C. Best Linear Unbiased Allele-Frequency Estimation in Complex Pedigrees. Biometrics. 2004;60:359–367. [PubMed]
  • Nicolae DL. Quantifying the Amount of Missing Information in Genetic Association Studies. Genetic Epidemiology. 2006;30:703–717. [PubMed]
  • Risch N, Teng J. The Relative Power of Family-Based and Case-Control Designs for Linkage Disequilibrium Studies of Complex Human Diseases I: DNA Pooling. Genome Research. 1998;8:1273–1288. [PubMed]
  • Schaid DJ. Evaluating Associations of Haplotypes with Traits. Genetic Epidemiology. 2004;27:348–364. [PubMed]
  • Slager SL, Schaid DJ. Evaluation of Candidate Genes in Case-Control Studies: a Statistical Method to Account for Related Subjects. American Journal of Human Genetics. 2001;68:1457–1462. [PubMed]
  • Thornton T, McPeek MS. Case-Control Association Testing with Related Individuals: A More Powerful Quasi-Likelihood Score Test. American Journal of Human Genetics. 2007;81:321–337. [PubMed]
  • Tzeng J-Y, Devlin B, Wasserman L, Roeder K. On the Identification of Disease Mutations by the Analysis of Haplotype Similarity and Goodness of Fit. American Journal of Human Genetics. 2003;72:891–902. [PubMed]
  • Wedderburn RWM. Quasi-likelihood Functions, Generalized Linear Models, and the Gauss-Newton Method. Biometrika. 1974;61:439–447.
  • Xie M, Yang Y. Asymptotics for Generalized Estimating Equations with Large Cluster Sizes. The Annals of Statistics. 2003;31:310–347.
  • Yuan K-H. A Theorem of Uniform Convergence of Stochastic Functions with Applications. Journal of Multivariate Analysis. 1997;62:100–109.
  • Yuan K-H, Jennrich RI. Asymptotics of Estimating Equations under Natural Conditions. Journal of Multivariate Analysis. 1998;65:245–260.
  • Zeger SL, Liang KY. Longitudinal Data Analysis for Discrete and Continuous Outcomes. Biometrics. 1986;42:121–130. [PubMed]
  • Zhang J, Schneider D, Ober C, McPeek MS. Multilocus linkage disequilibrium mapping by the decay of haplotype sharing with samples of related individuals. Genetic Epidemiology. 2005;29:128–140. [PubMed]