|Home | About | Journals | Submit | Contact Us | Français|
A large number of rare genetic variants have been discovered with the development in sequencing technology and the lowering of sequencing costs. Rare variant analysis may help identify novel genes associated with diseases and quantitative traits, adding to our knowledge of explaining heritability of these phenotypes. Many statistical methods for rare variant analysis have been developed in recent years, but some of them require the strong assumption that all rare variants in the analysis share the same direction of effect, and others requiring permutation to calculate the p-values are computer intensive. Among these methods, the sequence kernel association test (SKAT) is a powerful method under many different scenarios. It does not require any assumption on the directionality of effects, and statistical significance is computed analytically. In this paper, we extend SKAT to be applicable to family data. The family-based SKAT (famSKAT) has a different test statistic and null distribution compared to SKAT, but is equivalent to SKAT when there is no familial correlation. Our simulation studies show that SKAT has inflated type I error if familial correlation is inappropriately ignored, but has appropriate type I error if applied to a single individual per family to obtain an unrelated subset. In the contrast, famSKAT has the correct type I error when analyzing correlated observations, and it has higher power than competing methods in many different scenarios. We illustrate our approach to analyze the association of rare genetic variants using glycemic traits from the Framingham Heart Study.
In recent years, with the advances in whole-genome sequencing technology, assessing the association of rare genetic variants with complex diseases and quantitative traits has become of great interest. Rare genetic variants may account for some of the missing heritability unexplained by genetic loci identified by genome-wide association studies (GWAS) [Eichler et al., 2010], as single variant tests used in GWAS are underpowered for rare genetic variants [Li and Leal, 2008]. To increase power, burden tests have been proposed [Li and Leal, 2008; Morgenthaler and Thilly, 2007; Madsen and Browning, 2009; Morris and Zeggini, 2010]; these tests are based on collapsing rare genetic variants in a predefined genomic region with either a rare variant indicator or a weighted score. These methods implicitly assume that all rare genetic variants in the region have the same direction of effect and even the same effect size, which may not be true. Alternatively, the data-adaptive sum test [Han and Pan, 2010] and step-up approach [Hoffmann, Marini and Witte, 2010] do not require such an assumption and use the signs from single marker test to determine the direction of effects, but both of these approaches require permutation to evaluate statistical significance.
The sequence kernel association test (SKAT) [Wu et al., 2011] proposed recently is a flexible and computationally efficient regression-based approach for rare genetic variants analysis. No assumptions about the directions of effect or the effect sizes of rare genetic variants in the region are required for SKAT. Instead of requiring permutation for the p-value computation, Davies’ method [Davies, 1980] is used to compute the p-values analytically for SKAT. SKAT has been shown to be much more powerful than traditional burden tests in many different scenarios. SKAT can be used in the association analysis of both dichotomous and continuous phenotypes.
Family-based study designs have been widely used in linkage analysis of diseases and quantitative traits [Falk and Rubinstein, 1987; Ott, 1989; Terwilliger and Ott, 1992; Spielman, McGinnis and Ewens, 1993]. In GWAS, ordinary regression approaches are not applicable to family data, because inflated type I error is observed when familial correlation is not appropriately modeled. For quantitative traits, instead of ordinary linear regressions, linear mixed effects models that take familial correlation as a random effect with covariance proportional to the kinship matrix is commonly used for single marker tests in GWAS [Almasy and Blangero, 1998; Rabinowitz and Laird, 2000]. However, burden tests and other methods for joint analysis of rare genetic variants in family samples have not been well established.
In this paper, we use the framework of linear mixed effects models to extend SKAT for rare genetic variants association analysis with quantitative traits in family data. The family-based SKAT (famSKAT) has a different form of test statistic and distribution under the null hypothesis, but has the same rationale as SKAT. When there is no familial correlation, famSKAT is equivalent to SKAT. P-values for famSKAT are also calculated analytically without requiring permutation.
We demonstrate in our simulation studies that SKAT has inflated type I error in family samples when familial correlation is not appropriately considered. By contrast, famSKAT does not suffer from this issue and has correct type I error. We also show that famSKAT is more powerful than applying SKAT to an unrelated subset of the sample. For mixed datasets with both unrelated and related individuals, as the proportion of unrelated individuals decreases, the difference in power between SKAT and famSKAT increases, with famSKAT being always the more powerful approach of the two. Thus, by using famSKAT there is no need to reduce sample size by selecting an unrelated subset of individuals. Finally, we illustrate our approach by assessing the association between rare genetic variants using two glycemic traits in the Framingham Heart Study.
We first define notation and assumptions before we derive the SKAT statistic that accounts for familial correlation. Assuming a sample size of n, let the n × 1 vector of the quantitative trait y follow a linear mixed effects model
where X is an n × p covariate matrix, β is a p × 1 vector consisting of fixed effects parameters (an intercept and p − 1 coefficients for covariates), G is an n × q genotype matrix for q rare genetic variants of interest, γ is a q × 1 vector for the random effects of rare variants, δ is an n × 1 vector for the random effects of familial correlation, which is added to the SKAT model, and ε is an n × 1 vector for the error. The vector of error ε and the random effects γ and δ are assumed normally distributed and uncorrelated with each other:
where W is the pre-specified diagonal weight matrix for the rare variants of q × q, Φ is twice the kinship matrix of size n × n obtained from family information only, I is the identity matrix of size n × n, and τ, , are corresponding variance component parameters. In this parameter setting, we are interested in testing H0: τ = 0 versus H1: τ > 0, which is equivalent to testing H0: γ = 0 versus H1: γ ≠ 0. This is a variance component score test in the linear mixed effects model, which is a locally most powerful test [Wu et al., 2011; Lin, 1997].
Under these assumptions, the phenotypic variance can be written as
The log likelihood for the linear mixed effects model is
To derive a score test for H0:τ = 0, we first take the derivative with respect to τ to get
If we use the restricted maximum likelihood instead of the maximum likelihood method, we would get a different first term, but the same second term. In both cases, if we replace Σ by its consistent estimator, and treat genotype matrix G as fixed, then the first term in the score function is fixed and independent of phenotype data y. Following the same rationale in the derivation of the SKAT score statistic [Liu, Lin and Ghosh, 2007; Kwee et al., 2008], we take twice the second term to be derived as our test statistic.
Under the null hypothesis τ = 0, we can estimate
by fitting the null linear mixed effects model
The maximum likelihood estimators can be obtained using the function lmekin from R package kinship. We replace β, and (and hence Σ) by their maximum likelihood estimators and take
as the famSKAT test statistic. Under the null hypothesis, the variance of the residuals is
We note that even though the null model, test statistic, residual variance and null distribution of famSKAT have different forms compared to those of SKAT, they are directly connected. Actually, if we add a restriction on the model, famSKAT is equivalent to SKAT. Then
where is estimated from the null linear model
and famSKAT statistic becomes
with distribution under the null hypothesis
where λi are the eigenvalues of the matrix . They are proportional to SKAT statistic and null distribution matrix with the coefficient . We favor this form of null distribution matrix, rather than the form proposed in Wu et al. , because usually the sample size n is larger than the number of genetic variants of interest q, the non-zero eigenvalues of and are the same, but the first matrix is of size q × q, while the second matrix is of size n × n; and taking the square root of the diagonal matrix W is computationally much easier than taking the square root of P0.
We note that famSKAT can also be used when we want to provide a known heritability coefficient h2 externally, rather than estimating it from the data. By the reparametrization
when h2 is known, we can use the generalized least square method to estimate only σ2 under the null model. Then we can follow the rest of the famSKAT procedure to perform the test.
To evaluate the type I error, we performed several simulation studies under the null hypothesis of no genetic association. We compared four approaches: famSKAT, burden test accounting for familial correlation (famBT), SKAT which only takes the unrelated subset of the sample (unrSKAT) and SKAT. We used Kuonen’s saddlepoint method [Kuonen, 1999] to compute the p-values for famSKAT, unrSKAT and SKAT. For famBT, we fit the linear mixed effects model
where wj is the jth element on the diagonal of the pre-specified weight matrix W, Gj is the jth column of the genotype matrix G, the scalar γ is the fixed effect for the weighted genetic score, and y, X, β, δ, ε are defined in the same parameter setting as in famSKAT. The genotype effect in this model can be tested as fixed effect test H0: γ = 0 versus H1: γ ≠ 0.
We set the heritability of the trait
For each parameter setting, we simulated 100 genotype datasets with a total sample size of 1000 and 20 single nucleotide polymorphisms (SNP) with minor allele frequency (MAF) in the founders randomly sampled from a uniform distribution of 0.005 to 0.05, and with low (r = 0.1), moderate (r = 0.5), or high (r = 0.7) linkage disequilibrium (LD) between adjacent SNPs in the founders. The LD correlation between farther SNPs decays as an autoregressive model with order 1. We simulated haplotypes for unrelated founders with desired MAF and LD structure using the same procedure as HapSim [Montana, 2005], then we passed down the haplotypes to the next generation to simulate sib pairs, and took the remaining founders as unrelated individuals. Thus we created genotype datasets mixed with unrelated individuals and sib pairs, and let the proportion of unrelated individuals decrease from 75% to 50%, 25%, 0%. For each genotype dataset, 10,000 phenotype datasets including covariates were simulated by using the model
where age is a vector of continuous covariate generated from a normal distribution with mean 50 and standard deviation 5, sex is a vector of dichotomous covariate generated from a Bernoulli distribution with probability 0.5, ε follows a multivariate normal distribution with means 0 and covariance matrix Σ, where
We calculated the p-values of famSKAT, famBT, unrSKAT and SKAT by using the Wu weights [Wu et al., 2011], corresponding to the square of a beta density function of the observed MAF in the founders with parameters 1 and 25. We computed the empirical type I error at α levels of 0.01, 0.001 and 0.0001 by counting the proportion of p-values less than or equal to the corresponding α level in the 1 million genotype-phenotype datasets.
To evaluate the power of famSKAT, famBT and unrSKAT, we set the heritability of phenotype h2 = 0.5 and LD between adjacent SNPs in the founders r = 0.5, and performed simulations under different scenarios. For each parameter setting, we simulated 100 genotype datasets with a total sample size 1000 and 20 SNPs with MAF in the founders randomly sampled from a uniform distribution of 0.005 to 0.05. Similar to the null simulation setting, we simulated genotype datasets mixed with unrelated individuals and sib pairs, and changed the proportion of unrelated individuals from 75% to 50%, 25%, 0%. For each genotype dataset G, 10,000 phenotype datasets including covariates were simulated by using the model
where age, sex and ε are generated in the same way as in the type I error simulations, γ is a vector consisting of the effect sizes of the causal SNPs. We varied the proportion of causal SNPs from 20% to 50% and 80%, and we simulated both same and opposite directions of effects. Causal SNPs were randomly selected out of the 20 SNPs for each phenotype replicate, and in each parameter setting the effect sizes of causal SNPs were determined by
where MAFi is the MAF used to generate the genotype dataset for causal SNP i, and c is a constant for all causal SNPs in each phenotype replicate, calculated as
where R2, the total proportion of variance explained by all causal SNPs, was fixed at 1% for scenarios when all causal SNPs had effects in the same direction, and 5% for scenarios when 50% of the causal SNPs had positive effects and 50% had negative effects. D is the LD correlation matrix for the 20 SNPs, and ν is a vector indicating the directions of causal SNP effects in each replicate. We used the same weights for famSKAT, unrSKAT and famBT, which were the Wu weights calculated from the observed MAF in founders. The empirical power was evaluated at the α level of 0.001.
Table 1 shows the empirical type I errors of famSKAT, famBT, unrSKAT and SKAT at different α levels in 3 LD scenarios and 4 scenarios for the proportion of unrelated individuals. The results suggest that when SKAT is directly applied to the full sample with correlated individuals, it has inflated type I error at all α levels. The empirical type I error tends to be higher when LD decays. In the contrast, famSKAT, famBT and unrSKAT retain the correct type I errors. Thus, in subsequent power simulations we only investigated these three approaches. The distributions of the p-values from the four approaches for the scenario of LD between adjacent SNPs r = 0.5 and proportion of unrelated individuals 0% were shown in Figure 1. We found that famSKAT, famBT and unrSKAT all had uniform distribution of the p-values, while the distribution of the pvalues from SKAT was more likely to be small, explaining the inflated type I error.
Power simulation results of famSKAT, famBT and unrSKAT are shown in Figure 2. In all scenarios, 20 SNPs were analyzed. We simulated scenarios in which the proportion of causal SNPs was 20%, 50% or 80%, with effects in the same or opposite directions. As the proportion of unrelated individuals decreases from 75% to 50%, 25% and 0%, the sample size for unrSKAT also decreases from 875 to 750, 625 and 500. As a result, the power of unrSKAT also drops. In contrast, the power of famSKAT and famBT remains almost constant, regardless of the proportion of unrelated individuals. FamBT has higher power than famSKAT when the proportion of causal SNPs is greater than or equal to 50% and all causal SNPs have the same direction of effects, but it has almost no power in scenarios when causal SNPs have opposite directions of effects. Generally, famSKAT performs well in all these scenarios, suggesting that famSKAT is an omnibus method which does not have compromised power for different mixtures of related and unrelated individuals.
We used genotype data from Framingham SNP Health Association Resource (SHARe) and phenotype data from the Framingham Heart Study to analyze the association with two glycemic traits: fasting glucose and log-transformed fasting insulin. We restricted our analyses to SNPs with MAF less than 5% within 100kb of 16 gene regions selected for the prior association with fasting glucose, and 2 genes reported to be associated with log-transformed fasting insulin [Dupuis et al., 2010]. We adjusted the fasting glucose analysis for age and sex, and logtransformed insulin was additionally adjusted for body mass index. We performed famSKAT and famBT for all individuals with both genotype and phenotype available, and performed SKAT for only a subset of unrelated individuals. For comparison purpose, we calculated the MAF using a subset of unrelated individuals and applied Wu weights to all three methods.
We investigated the association between fasting glucose and rare genetic variants in 16 gene regions previously shown to be associated in large scale GWAS [Dupuis et al., 2010]: ADCY5, ADRA2A, C2CD4B, CRY2, DGKB-AGMO, FADS1, G6PC2, GCK, GCKR, GLIS3, MADD, MTNR1B, PROX1, SLC2A2, SLC30A8, and TCF7L2. The results are shown in Table 2. After adjusting for multiple testing using a Bonferroni correction, we detected no association between fasting glucose and rare genetic variants in the selected gene regions at the family-wise α level of 0.05, for all three methods. CRY2 reaches the nominal significance level with a p-value of 0.0381 using famSKAT and 0.0085 using famBT, and G6PC2 reaches the nominal significance level with a p-value of 0.0418 using famSKAT, but none of these gene regions reaches nominal statistical significance when evaluated using unrSKAT.
We also investigated the association between log-transformed fasting insulin and rare genetic variants in 2 gene regions previously shown to be associated in large scale GWAS [Dupuis et al., 2010]: GCKR and IGF1. After adjusting for multiple testing using a Bonferroni correction, IGF1 shows association with log transformed fasting insulin with a nominal p-value of 0.0232 using famSKAT and 0.0234 using famBT. Neither gene reaches even the nominal significance level using SKAT.
Table 2 also shows that the sample size for unrSKAT is much smaller than that for famSKAT and famBT, because even though the Framingham Heart Study is not a family-based cohort, there are many families in the study. Thus, by selecting unrelated individuals we greatly reduced the sample size. Because some SNPs with rare minor alleles may not be polymorphic in the subset of unrelated individuals, for some gene regions the number of SNPs for unrSKAT is smaller than the number of SNPs for famSKAT and famBT.
We also performed a genome-wide sliding window analysis on these two traits, as well as logtransformed HOMA-IR and HOMA-B, using SHARe genotype data. We only included SNPs with MAF less than 5% and ran the analysis using a sliding window of 500kb, with 250kb overlap each with previous and subsequent windows. We removed windows with 0 or 1 SNP, resulting in 10,546 windows for all autosomes with the number of SNPs ranging from 2 to 76 with median 18. No window reached the genome-wide significance using famSKAT, famBT or unrSKAT. The Q-Q plots for famSKAT are shown in Figure 3. There is minimal inflation of the p-values from this genome-wide analysis.
The computation time of famSKAT depends on both the sample size and the number of SNPs. The empirical run time of famSKAT, famBT and SKAT in analyzing sib pairs with indicated total sample sizes on a single computing node with 2.33 GHz CPU and 4 GB memory is shown in Figure 4. With a small sample size, the limiting step in famSKAT is fitting the null linear mixed effects model, so the computation time is comparable to that of famBT, which also requires fitting a linear mixed effects model. As the sample size increases, all three methods require more computation time, and the time of famSKAT and SKAT increases dramatically. Both famSKAT and SKAT require matrix calculation, and the limiting step in famSKAT becomes inverting the matrix , which takes about 90% of the computation time when the sample size is 5000. The genome-wide sliding window analysis of SHARe genotype data using a sliding window of 500kb takes about 5 hours for chromosome 1 on a single computing node with 2.33 GHz CPU and 4 GB memory.
In this paper, we propose famSKAT as an extension of SKAT which can be applied to data with familial correlation. We demonstrate that famSKAT is a general and flexible variance component score test approach, which is equivalent to SKAT when the familial variance component is set to 0. It can be applied to quantitative traits with unknown or known heritability.
Compared with famBT, famSKAT is advantageous in power when the proportion of causal SNPs in a genomic region is small, and when not all causal SNPs have the same direction of effects. As expected, famBT outperforms famSKAT when the proportion of causal SNPs is greater than or equal to 50% and all these SNPs have positive effects, but the performance of famSKAT in these scenarios is still satisfactory. In real data analysis, when we do not have sufficient a priori information about the proportion of causal SNPs or the directions of effects, famSKAT would be a better choice over famBT.
We show that when SKAT is inappropriately applied to correlated data, it has inflated type I error. Thus, the best we can do for SKAT is to select unrelated individuals from the whole sample. However, our power simulations demonstrate that this strategy is not in favor of power in many scenarios. In the contrast, we do not need to reduce our sample size if we use famSKAT. Our real data example from the Framingham Heart Study also shows that SKAT does not even have an observation which reaches the nominal significance level of 0.05.
Common genetic variants at 16 gene regions we chose for fasting glucose and 2 gene regions we chose for log-transformed fasting insulin have been shown to be associated with either trait in large GWAS [Dupuis et al., 2010]. However, we do not have solid evidence to show that there is strong association between either trait and the rare genetic variants in these regions. We noticed that the sample size in this analysis was far smaller than in Dupuis et al. , which reduced the power. In addition, the SHARe project was not specifically designed for rare variants analysis, so most SNPs in our genotype dataset are common SNPs which were excluded from the analysis. With the progress of sequencing studies, we should be able to identify much more rare variants and perform the candidate gene or even genome-wide analysis again using the new genotype dataset with dense rare genetic variants. On the other hand, some gene regions may be truly associated with the trait only through common SNPs, so we do not expect to identify the association with rare genetic variants for all these gene regions we selected.
With the development in sequencing technology and the lowering of the cost, sequencing data which contain a lot of rare genetic variants have become available, not only for case-control studies, but also for cohorts that include family members. Based on SKAT, one of the most powerful rare genetic variants analysis methods to date, we developed famSKAT in the hope of facilitating rare genetic variants analysis to identify novel genes associated with quantitative traits. With famSKAT, cohorts with family data can perform the association analysis with rare genetic variants, using as much data as possible, without having to select unrelated individuals from the pedigree. FamSKAT has been implemented in R, and source code is available at http://www.bumc.bu.edu/linga/research/publications/famskat/.
For calculating the p-values, we recommend using Kuonen’s saddlepoint method [Kuonen, 1999] instead of Davies’ method [Davies, 1980]. As a method based on numerical integration, Davies’ method requires specifying the accuracy. When the p-value is expected to be very small, Davies’ method cannot calculate it accurately. Table 3 shows this numerical issue in a power simulation context. Davies’ method suffers from negative and zero p-values (and possibly significant roundoff error) regardless of the accuracy specified. In the contrast, Kuonen’s method does not have such issues. Thus, if we perform a genome-wide rare variants analysis using sequence data, from which we expect extreme low p-values, Kuonen’s method might be a better choice over Davies’ method.
Even though famSKAT was developed for analyzing rare genetic variants, it can also be used for common variant analysis, combined common and rare variant analysis or conditional association analyses. Depending on the research hypothesis, common variants can be treated as fixed effects in the model, or random effects along with the rare genetic variants. During the review of this paper, we became aware that Schifano et al.  had recently developed a SNP set association analysis approach for common variants analysis in family data, which is essentially equivalent to our method. The use of famSKAT combined with the collapsing of some very rare genetic variants such as singletons is also possible. Similar with SKAT, external weights based on annotation information or functional prediction can be incorporated to further boost the power.
The authors thank Dr. Thomas Lumley for his insights in Kuonen’s saddlepoint method. This research was partially supported by NIH awards R01 DK078616, U01 DK85526 and K24 DK080140. A portion of this research was conducted using the Linux Clusters for Genetic Analysis (LinGA) computing resources at Boston University Medicine Campus. The Framingham Heart Study is conducted and supported by the National Heart, Lung, and Blood Institute (NHLBI) in collaboration with Boston University (Contract No. N01-HC-25195). This work was partially supported by a contract with Affymetrix, Inc for genotyping services (Contract No. N02-HL-6-4278).