|Home | About | Journals | Submit | Contact Us | Français|
Variance-component analysis (VCA), the traditional method for handling correlations within families in genetic association studies, is computationally intensive for genome-wide analyses, and the computational burden of VCA, a likelihood-based test, increases with family size and the number of genetic markers. Alternative approaches that do not require the computation of familial correlations is preferable, provided that they do not inflate type I error or decrease power. We performed a simulation study to evaluate practical alternatives to VCA that use regression with generalized estimating equations (GEE) in extended family data. We compared the properties of linear regression with GEE applied to an entire extended family structure (GEE-EXT) and GEE applied to nuclear family structures split from these extended families (GEE-SPL) to variance-components likelihood-based methods (FastAssoc). GEE-EXT was evaluated with and without robust variance estimators to estimate the standard errors. We observed similar average type I error rates from GEE-EXT and FastAssoc compared to GEE-SPL. Type I error rates for the GEE-EXT method with a robust variance estimator were marginally higher than the nominal rate when the MAF was < 0.1, but were close to nominal rate when MAF ≥ 0.2. All methods gave consistent effect estimates and had similar power. In summary, the GEE framework with the robust variance estimator, the computationally fastest and least data management intensive, appears to work well in extended families and thus provides a reasonable alternative to full variance components approaches for extended pedigrees in the GWAS setting.
Genetic association studies in family data are complicated by the non-independence of family members. Special attention is required to handle these within-family correlations. When nuclear families are used, the tests of association are straightforward. The child’s genotype and phenotype data are analyzed conditioning on the parental genotypes (e.g. conditional logistic regression, transmission disequilibrium test (TDT)  and TDT-variant tests e.g. FBAT  and PBAT ). These methods do not use the phenotypic information on the parents, even when it is available. When such data do exist, especially in extended pedigrees, methods, such as variance-components analysis, that incorporate the phenotypic information of the parents to test for association can yield more power. However, modeling the genotype and phenotype data from all family members and allowing for the familial correlation requires likelihood-based methods such as variance-components analysis (VCA). Variance-components analysis is widely accepted as the most powerful method for association tests with genotype and phenotype information in a family-based study. However, for large families with a large number of markers, VCA can be computationally intractable. Millions of SNPs used in genome-wide association studies (GWAS) limit the analysis of extended families with full likelihood methods because of the number of genotype vectors and genotype correlations that must be handled. To facilitate analysis of extended pedigree data, family-based studies with extended families are often split into smaller nuclear families for analysis, losing some genetic information about relationships across those nuclear families.
To address the computation issue, a score version of variance-components analysis has been proposed as a computationally efficient alternative to computing the familial likelihood . However, to properly correct for correlation within families, variance-components analysis still requires the estimation of the identity by descent (IBD) allele sharing matrix, which is extremely time-consuming. Other variations of VCA have been proposed, such as the Genome-wide Rapid Association using Mixed Model and Regression (GRAMMAR) method , a compressed mixed linear model (TASSEL) , and the recent Efficient Mixed-Model Association (EMMA and EMMAX) [7,8]. These methods optimize the IBD sharing matrix calculation and offer some computational improvement to VCA. Using information based on genetic markers to adjust for correlation allows these methods to properly control for type I error rate. However, even an efficient algorithm to calculate relationship from genetic markers is still computational complex.
To avoid computational complexity from modeling relationship based on genetic markers, a simpler correlation model, which does not require the estimation of the IBD sharing matrix, that uses regression with generalized estimating equations (GEE) [9,10] can be used. GEE is a faster method that can accommodate correlation within families and has been used previously in genome-wide association studies . GEE accommodates familial clustering by assuming an empirical correlation structure rather than calculating a family-specific correlation, and a robust variance estimator is used in conjunction with GEE to handle misspecified correlation structure. However, the variance of the association parameters tends to be underestimated, and type I error rate may be increased .
One explanation for an inflated type I error rate could be the misspecification of the correlation structure (all types of member pairs are typically given the same correlation) which makes GEE less efficient . Splitting extended families into nuclear families may help to homogenize the pairwise correlations within a cluster for GEE and thus might improve the efficiency of GEE in the family-based association tests. However, in the GWAS setting, splitting the data into nuclear families requires additional laborious data manipulation, and additional storage space. Although many software packages have been developed to handle this task, it is not a trivial step before association analysis can be carried out. Since the original data were collected as “extended pedigree” format, it is easier to analyze the data as is, i.e. in “extended pedigree” format, instead of splitting the data into nuclear families. Furthermore, it is possible that splitting extended families into nuclear families could introduce additional correlation among newly generated nuclear families.
Thus, researchers with both phenotype and genotype data on extended families planning to carry out GWAS are faced with practical considerations regarding analyst and computer time and seek an approach that is fast and efficient, while accommodating family correlations and maintaining appropriate test size and power. Here, we compare relatively fast GEE methods to variance components methods to examine the properties of each and the time required under a set of scenarios modeled after a practical example in our applied research.
We used computer simulation methods to compare the type I error rates and power to detect associations for analyses of family data using 1) simple linear regression (LM) a method that does not accommodate familial correlations as a baseline, 2) GEE, with clusters based on extended family structure (GEE-EXT), 3) GEE with clusters based on nuclear families derived from splitting original extended family structures (GEE-SPL), and 4) a score test based on variance-components analysis (FastAssoc). A total of 10,000 replicates were tested for each pair of genotype-phenotype associations.
Family structure was based on families from the Johns Hopkins GeneSTAR study , with the intent of applying these approaches to GWAS applications in this data set. The study is comprised of 3,505 people in 498 families, with 242 two-generation families (22.1%) and 260 three-generation families (77.9%). On average, there were 6.98 members per extended family (ranging from 3 to 26).
We split extended families into nuclear families by duplicating 525 individuals from 260 three-generation families to form 964 nuclear families. We maintained the number of people with genotype and phenotype data by removing phenotypes of duplicated individuals. After splitting extended families, we reduced the number of relative pairs with complete data, i.e. pairs of spouses (families from participants with more than one spouse were split), parent-offspring pairs, and distantly related relatives, but maintained the number of sibling-sibling pairs within nuclear families (Table 1). The resulting nuclear families had an average of 3.93 members (ranging from 3 to 12).
The nuclear families resulting from a split of a single extended family are potentially correlated. Therefore, to compare a data set with a fixed number of nuclear families derived from splitting to one with the same number of nuclear families where all are independent, we simulated independent sets of nuclear families that were completely unrelated to each other. The simulated nuclear families had homogeneous structures with modest size consisting of two parents and four children. A total of 584 families were simulated, so that we had a total of 3,504 individuals in the full data set, similar to the number in our other simulations above.
We then compared type I error rates from the analyses based on nuclear families split from extended families to type I error rates for simulated independent nuclear families of the same size and total numbers.
For each simulation, we randomly generated 4 independent bi-allelic markers with minor allele frequencies (MAF) of 0.05, 0.10, 0.20, and 0.40 in the founders. A gene dropping algorithm was implemented in R version 2.9  to assign genotypes in the offspring recursively, based on Mendelian expectations. These markers were used as input genotype files for subsequent simulations of phenotype relationships.
We used MERLIN to simulate phenotype-genotype associations in family data [4,15]. For each genotype-phenotype association, we simulated seven phenotypes with locus-specific heritabilities of 0, 0.01, 0.05, 0.1, 0.5, 1, and 2% to mimic heritability levels of complex genetic diseases consisting of both polygenic models and major gene models . All phenotypes were simulated to have a trait heritability of 40%.
Because most studies have missing information for some family members, we also explored the effect of missing data on type I error rates for GEE compared to FastAssoc. In the simulated independent nuclear families, we randomly generated 6 levels of missing genotype data: complete data (COMP); missing one parent (M1P); missing both parents (M2P); missing one parent and one child (M1P1C); missing both parents and one child (M2P1C); and missing two parents and two children (M2P2C).
We tested for association between each genotype-phenotype pair using 1) simple linear regression (LM), 2) GEE clustered by extended family structure (GEE-EXT), 3) GEE clustered by nuclear families derived from splitting the original extended family structures (GEE-SPL), and 4) a score test based on variance-components analysis (FastAssoc). We denote phenotype, y, and genotype, g, for each observation (yij, gij), for member j, where j = 1,…,k…,ni, of family i, where i = 1, …, N.
Linear regression was chosen as a test of association to provide a reference, as we expect that ignoring correlation within a family would inflate the type I error rate. LM tests for genetic association at each SNP, gij, by modeling the mean phenotype value as, μij = E(Yij) = μ + βgij + εij, where μ is the population mean. β is the additive effect for each gij SNP, which is coded as 0, 1, or 2. Each observation was treated as if they were independent regardless of the fact that they are correlated within families, i.e. assuming. Corr(εij,εik) = 0. The information concerning their relationship within each family is discarded leading to a biased estimate of the variance of the regression parameter estimates and the test statistics [9,10].
Generalized estimating equations (GEE) were proposed as an extension of generalized linear modeling to handle clustered data based on a quasi-likelihood approach [9,10]. For our genetic association tests, we used the identity link function, and modeled the mean phenotypic value conditional on the genotype, gij as, μij = E(Yij gij) = g’ijβ, with an exchangeable working correlation matrix, Corr(Yij,Yik) = α for all i and j ≠ k to accommodate correlation among family members. Robust variance estimators were used to estimate the standard errors of the test statistics to improve robustness to model specification error [17,18].
We explored the effect of family structure in defining a correlation structure: a cluster based on extended families versus a cluster based on nuclear families resulting from splits of extended families. In the case of extended family structures, there are many relative types of among family members from closely related, e.g. parent-offspring, sibling-sibling, distantly related, e.g. cousin, avuncular, to unrelated, i.e. spousal relationships, with the expected genetic correlations of 1/2, 1/2, 1/2, 1/8, 1/8, and 0, respectively. In nuclear families there are only three types of familial correlation: parent-offspring, spouse, and sibling-sibling, with the expected genetic correlations of 1/2, 0, and 1/2. Therefore, the exchangeable correlation structure is more appropriate for nuclear families than for extended families.
To explore the effects of the robust variance estimator on type I error rate for GEE-EXT, we compared the type I error rate and power for GEE-EXT when using a non-robust standard variance estimator.
We used a score version of variance-components analysis for testing genetic associations in extended family data [4,15]. Briefly, the expected phenotype E(Yij) is modeled as , where μ is the population mean, while βg is the additive effect to be estimated for each SNP. The expected genotype score, , is imputed based on the information from observed neighbor genotype when gij is missing. The corresponding variance-covariance matrix for family i, Ωijk, which allows for correlations of phenotypes for all specific type of relative pairs in a family, is defined as
where πijk represents identical-by-descent (IBD) allele sharing, and ijk represents the kinship coefficient. Conventionally, the expected phenotype and the covariance matrix are used to maximize the multivariate normal likelihood, , with and without and βg. The likelihood ratio test is then used as a test of association for each SNP. To speed up the test for association, E(yi)base and are estimated once without and βg to form a simplified variance-components model. Then, an alternate test of association is calculated as,
which can be compared to a χ2 with 1 df.
Analyses using LM, GEE, and linear mixed effect models were done in R version 2.9 using the functions lm() in the stats package , gee() in gee package , lmekin() in the kinship package . Variance-components analysis was done in MERLIN using the fastAssoc option [4,15].
We calculated type I error rates as the proportion of replicates under the null hypothesis (locus specific heritability set to 0) with an association test p value ≤ 0.001, 0.01 and 0.05. For estimates of power at the α = 0.05 level, we first adjusted the appropriate type I error critical value for each testing method based on null simulations, and then used the value at which the type I error rate was 5% as the significance threshold for association. Power was then calculated as the proportion of replicates under a particular alternative that had p-values less than this empirically derived critical value.
Across the range of simulated allele frequencies, LM, which ignored the correlation among family members, had inflated type I error rates as expected (Table 2). In contrast, variance-components analysis had type I error rates equal to the nominal type I error rate across the range of allele frequencies. Type I error rates for the GEE-EXT method (GEE clustered based on the original extended family structure) were slightly higher than the nominal type I error rate when the MAF was < 0.10, but were close to the nominal rate when MAF ≥ 0.20. GEE-EXT showed type I error rates similar to FastAssoc in most situations. The type I error rate for the GEE-SPL method (GEE clustered based on nuclear families resulting from splits of extended families) was slightly inflated. The type I error from GEE applied to the simulated data of independent nuclear families with same number of individuals, however, yielded type I error rates equal to the nominal type I error rate across the range of allele frequencies (Table 3). Thus, it appears that correlation across nuclear families introduced by splitting extended families inflates the type I error for GEE.
To estimate the power, we assumed a fixed total heritability of 40% while varying the locus-specific heritability from 0.01% to 2%. To account for the effect of inflated type I error rates for some of the methods, we used a corrected threshold to maintain 5% type I error when assessing power (Table 4). All methods yielded very similar power, although the power from a variance-components analysis was slightly higher than the others (Figure 1, and Supplementary Table 1).
The estimated effect sizes from each method were highly correlated and nearly identical (Pearson’s correlation ≥ 0.995; Figure 2). Moreover, the 95% CI of the estimated effect size covered the simulated effect sizes 94.7, 94.0, 93.5, and 90.2 % of the times for FastAssoc, GEE-EXT, GEE-SPL, and LM, respectively (Table 5). Using FastAssoc for comparison, GEE-EXT tended to slightly overestimate the standard errors; whereas, GEE-SPL yielded standard errors that were too small (Figure 3). The standard error from LM was under-estimated as expected; hence, the test statistic was over-estimated and had a higher type I error rate.
To examine the effect of missing data in nuclear families, we varied the level of missing data from 1 to 4 persons out of 6 in a family. We observed that missing data did not affect type I error rates for GEE and FastAssoc, and these two methods had stable type I error rates across the range of missing data patterns (Figure 4). In contrast, type I error rates observed for LM were lower as the number of missing people per family was higher and the correlation among family members became less relevant (result not shown).
Extended families are a computational problem for GWAS analysis of family data, particularly in studies with extensive phenotyping. Instead of using a computationally intensive variance-components method to test for genome-wide association signals, we proposed to use generalized linear models with generalized estimating equations, which require much less computing time while still accounting for correlation within families. The two major concerns for this approach have been inflated type I error rates and reduced power.
Here, we show that the power of GEE was very similar to variance-components analysis. Variance-components analysis showed slightly more power than GEE when the locus specific heritability was greater than 0.5%; however, GEE showed a slight advantage over variance-components analysis when locus specific heritability was less than 0.5%.
We confirmed the speculation that GEE applied to extended families would yield slightly inflated type I error. Splitting extended families into smaller nuclear families to make the correlation structure within a family more closely resemble an exchangeable correlation structure was proposed to accommodate this type I error inflation. However, applying GEE on splits of extended families actually increased the type I error rate more than simply using GEE on the original extended family structure. This is most likely due to correlation across clusters (nuclear families that are related because they came from the same extended family). When type I error rates were calculated based on similar data sets with simulated independent nuclear families, this inflation disappeared. Although GEE can account for correlation among observations within each cluster, it still assumes independence between each cluster. As an aside, we compared our application of GEE-EXT (GEE applied to extended families data with a robust variance estimator) to the same approach without a robust variance estimator (GEE-NR), and saw an increase in type I error, highlighting the need for employing the robust variance estimator when applying GEE to extended pedigrees (Supplementary Table 1).
For quantitative traits, a mixed effect model (MEM), i.e. a random intercept model, is the other alternative to GEE. However, to achieve the optimal type I error rate, it is necessary to specify the correct choice of variance-covariance matrix for the mixed model. The choices of variance-covariance matrix for MEM range from a simple exchangeable correlation structure, a matrix of kinship coefficients, or an estimated IBD sharing matrix. When an IBD sharing matrix is used, MEM is equivalent to VCA methods. However, as previously noted, calculating the IBD sharing matrix for every marker in the GWAS data is computationally intensive and is the major drawback of this option. To alleviate the complexity from using measured genetic data, the kinship coefficient, which is the expected average correlation between two relatives across all markers in the genome, can be used. However, this overall correlation is often different from the locus-specific IBD sharing and thus using the kinship coefficient alone also does not always yield the correct standard error estimates (data not shown). In contrast to mixed models, GEE offers the convenient use of a simple exchangeable correlation structure. This simple correlation matrix together with the use of robust variance estimator yields proper standard error estimates and a proper control for type I error rate without computational burdens of VCA or MEM.
One limitation of our study was the limited number of simulation replicates performed, and thus the limit on p values and effect sizes that can be estimated for a given sample size. It is not feasible to perform thousands of simulations for GWAS-scale datasets. Thus, we chose no more than 10,000 replicates per simulation experiment, and evaluated type I error rates at three critical levels feasible for 10,000 replicates (α=0.001, 0 .01, 0 .05). We found a consistent pattern of comparison across methods regardless of type I error rate evaluated and thus expect that our results hold when a GWAS threshold such as 5×10−8 is applied.
The effect sizes were quite consistently estimated as expected across all tested methods, although SE estimates for GEE of split families tended to be somewhat underestimated, consistent with the type I error results. Although the working correlation matrix is specified incorrectly for GEE on extended families, use of the robust variance estimator provides appropriate type I error and coverage [9,10]. Thus, we recommend that family structure remain intact, and that the GEE method can be applied to the original family structure to test for association using an exchangeable correlation structure. Although type I error rates for GEE in extended families were slightly higher for genetic markers with low MAF (< 0.1), the type I error rate was appropriate when the genetic variants became more common in the sample.
Because extended families often do not contain complete genotype data for all family members, we also sought to ensure that missing genotypes did not affect our results regarding the appropriateness of using GEE methods in the family association context. Although we did not directly examine the effect of missing data in extended families, we did examine the relative impact of patterns of missing genotypes across methods among nuclear pedigrees. Our results showed that GEE performs similarly to variance-components in the presence of variable degrees of missingness.
Ignoring correlations within families in LM doubled the type I error rate to almost 10% at a nominal significance level of 0.05. However, we did not find that correcting for inflation of type I error rate significantly reduced the power of association tests as has previously been reported when relatedness in family was ignored .
Another common method, implemented in FBAT/PBAT, can be used for association studies in extended family data, but it does not use available phenotype information in the parents . Hence, FBAT/PBAT often has limited power compared to variance-component analysis or GEE when phenotypes are available on all types of family members.
For genome-wide association analysis (GWAS) of family data, using a method such as GEE to screen for association signals can significantly reduce the amount of time needed to identify the potential signals in the genome. We showed that GEE, a faster alternative to variance-components, works well in extended family structures without the need to split the families into nuclear families. For loci with a locus-specific heritability of at least 1%, FastAssoc has a slight power advantage over GEE; however, the difference in power is minimal (< 2 %). Thus, GWAS using GEE should not limit the ability to detect true associations, and is only slightly more prone to false positives. Control of this slightly inflated type I error can be achieved by retesting the potential signals using the gold standard test with proper type I error rate.
Supplementary Table 1: Type I error rate at P ≤ 0.05 and corrected power to detect association signals using the critical value corresponding to a 5% type I error for that method
This work was supported in part by grants from the National Institute of Health/National Heart, Lung, and Blood Institute Grant U01 HL72518 and R01 HL087698 (RAM, LRY, LCB, DMB), Division of the Intramural Research of the National Human Genome Research Institute, NIH, the Department of Epidemiology, Johns Hopkins Bloomberg School of Public Health and Committee on Higher Education, and the Royal Thai Government scholarship (BS).