Extended families are a computational problem for GWAS analysis of family data, particularly in studies with extensive phenotyping. Instead of using a computationally intensive variance-components method to test for genome-wide association signals, we proposed to use generalized linear models with generalized estimating equations, which require much less computing time while still accounting for correlation within families. The two major concerns for this approach have been inflated type I error rates and reduced power.

Here, we show that the power of GEE was very similar to variance-components analysis. Variance-components analysis showed slightly more power than GEE when the locus specific heritability was greater than 0.5%; however, GEE showed a slight advantage over variance-components analysis when locus specific heritability was less than 0.5%.

We confirmed the speculation that GEE applied to extended families would yield slightly inflated type I error. Splitting extended families into smaller nuclear families to make the correlation structure within a family more closely resemble an exchangeable correlation structure was proposed to accommodate this type I error inflation. However, applying GEE on splits of extended families actually increased the type I error rate more than simply using GEE on the original extended family structure. This is most likely due to correlation across clusters (nuclear families that are related because they came from the same extended family). When type I error rates were calculated based on similar data sets with simulated independent nuclear families, this inflation disappeared. Although GEE can account for correlation among observations within each cluster, it still assumes independence between each cluster. As an aside, we compared our application of GEE-EXT (GEE applied to extended families data

*with* a robust variance estimator) to the same approach without a robust variance estimator (GEE-NR), and saw an increase in type I error, highlighting the need for employing the robust variance estimator when applying GEE to extended pedigrees (

Supplementary Table 1).

For quantitative traits, a mixed effect model (MEM), i.e. a random intercept model, is the other alternative to GEE. However, to achieve the optimal type I error rate, it is necessary to specify the correct choice of variance-covariance matrix for the mixed model. The choices of variance-covariance matrix for MEM range from a simple exchangeable correlation structure, a matrix of kinship coefficients, or an estimated IBD sharing matrix. When an IBD sharing matrix is used, MEM is equivalent to VCA methods. However, as previously noted, calculating the IBD sharing matrix for every marker in the GWAS data is computationally intensive and is the major drawback of this option. To alleviate the complexity from using measured genetic data, the kinship coefficient, which is the expected average correlation between two relatives across all markers in the genome, can be used. However, this overall correlation is often different from the locus-specific IBD sharing and thus using the kinship coefficient alone also does not always yield the correct standard error estimates (data not shown). In contrast to mixed models, GEE offers the convenient use of a simple exchangeable correlation structure. This simple correlation matrix together with the use of robust variance estimator yields proper standard error estimates and a proper control for type I error rate without computational burdens of VCA or MEM.

One limitation of our study was the limited number of simulation replicates performed, and thus the limit on p values and effect sizes that can be estimated for a given sample size. It is not feasible to perform thousands of simulations for GWAS-scale datasets. Thus, we chose no more than 10,000 replicates per simulation experiment, and evaluated type I error rates at three critical levels feasible for 10,000 replicates (α=0.001, 0 .01, 0 .05). We found a consistent pattern of comparison across methods regardless of type I error rate evaluated and thus expect that our results hold when a GWAS threshold such as 5×10^{−8} is applied.

The effect sizes were quite consistently estimated as expected across all tested methods, although SE estimates for GEE of split families tended to be somewhat underestimated, consistent with the type I error results. Although the working correlation matrix is specified incorrectly for GEE on extended families, use of the robust variance estimator provides appropriate type I error and coverage [

9,

10]. Thus, we recommend that family structure remain intact, and that the GEE method can be applied to the original family structure to test for association using an exchangeable correlation structure. Although type I error rates for GEE in extended families were slightly higher for genetic markers with low MAF (< 0.1), the type I error rate was appropriate when the genetic variants became more common in the sample.

Because extended families often do not contain complete genotype data for all family members, we also sought to ensure that missing genotypes did not affect our results regarding the appropriateness of using GEE methods in the family association context. Although we did not directly examine the effect of missing data in extended families, we did examine the relative impact of patterns of missing genotypes across methods among nuclear pedigrees. Our results showed that GEE performs similarly to variance-components in the presence of variable degrees of missingness.

Ignoring correlations within families in LM doubled the type I error rate to almost 10% at a nominal significance level of 0.05. However, we did not find that correcting for inflation of type I error rate significantly reduced the power of association tests as has previously been reported when relatedness in family was ignored [

21].

Another common method, implemented in FBAT/PBAT, can be used for association studies in extended family data, but it does not use available phenotype information in the parents [

3]. Hence, FBAT/PBAT often has limited power compared to variance-component analysis or GEE when phenotypes are available on all types of family members.