Home | About | Journals | Submit | Contact Us | Français |

**|**HHS Author Manuscripts**|**PMC3642218

Formats

Article sections

Authors

Related links

Genet Epidemiol. Author manuscript; available in PMC 2014 February 1.

Published in final edited form as:

Published online 2012 December 26. doi: 10.1002/gepi.21703

PMCID: PMC3642218

NIHMSID: NIHMS452466

The publisher's final edited version of this article is available at Genet Epidemiol

See other articles in PMC that cite the published article.

A large number of rare genetic variants have been discovered with the development in sequencing technology and the lowering of sequencing costs. Rare variant analysis may help identify novel genes associated with diseases and quantitative traits, adding to our knowledge of explaining heritability of these phenotypes. Many statistical methods for rare variant analysis have been developed in recent years, but some of them require the strong assumption that all rare variants in the analysis share the same direction of effect, and others requiring permutation to calculate the p-values are computer intensive. Among these methods, the sequence kernel association test (SKAT) is a powerful method under many different scenarios. It does not require any assumption on the directionality of effects, and statistical significance is computed analytically. In this paper, we extend SKAT to be applicable to family data. The family-based SKAT (famSKAT) has a different test statistic and null distribution compared to SKAT, but is equivalent to SKAT when there is no familial correlation. Our simulation studies show that SKAT has inflated type I error if familial correlation is inappropriately ignored, but has appropriate type I error if applied to a single individual per family to obtain an unrelated subset. In the contrast, famSKAT has the correct type I error when analyzing correlated observations, and it has higher power than competing methods in many different scenarios. We illustrate our approach to analyze the association of rare genetic variants using glycemic traits from the Framingham Heart Study.

In recent years, with the advances in whole-genome sequencing technology, assessing the association of rare genetic variants with complex diseases and quantitative traits has become of great interest. Rare genetic variants may account for some of the missing heritability unexplained by genetic loci identified by genome-wide association studies (GWAS) [Eichler et al., 2010], as single variant tests used in GWAS are underpowered for rare genetic variants [Li and Leal, 2008]. To increase power, burden tests have been proposed [Li and Leal, 2008; Morgenthaler and Thilly, 2007; Madsen and Browning, 2009; Morris and Zeggini, 2010]; these tests are based on collapsing rare genetic variants in a predefined genomic region with either a rare variant indicator or a weighted score. These methods implicitly assume that all rare genetic variants in the region have the same direction of effect and even the same effect size, which may not be true. Alternatively, the data-adaptive sum test [Han and Pan, 2010] and step-up approach [Hoffmann, Marini and Witte, 2010] do not require such an assumption and use the signs from single marker test to determine the direction of effects, but both of these approaches require permutation to evaluate statistical significance.

The sequence kernel association test (SKAT) [Wu et al., 2011] proposed recently is a flexible and computationally efficient regression-based approach for rare genetic variants analysis. No assumptions about the directions of effect or the effect sizes of rare genetic variants in the region are required for SKAT. Instead of requiring permutation for the p-value computation, Davies’ method [Davies, 1980] is used to compute the p-values analytically for SKAT. SKAT has been shown to be much more powerful than traditional burden tests in many different scenarios. SKAT can be used in the association analysis of both dichotomous and continuous phenotypes.

Family-based study designs have been widely used in linkage analysis of diseases and quantitative traits [Falk and Rubinstein, 1987; Ott, 1989; Terwilliger and Ott, 1992; Spielman, McGinnis and Ewens, 1993]. In GWAS, ordinary regression approaches are not applicable to family data, because inflated type I error is observed when familial correlation is not appropriately modeled. For quantitative traits, instead of ordinary linear regressions, linear mixed effects models that take familial correlation as a random effect with covariance proportional to the kinship matrix is commonly used for single marker tests in GWAS [Almasy and Blangero, 1998; Rabinowitz and Laird, 2000]. However, burden tests and other methods for joint analysis of rare genetic variants in family samples have not been well established.

In this paper, we use the framework of linear mixed effects models to extend SKAT for rare genetic variants association analysis with quantitative traits in family data. The family-based SKAT (famSKAT) has a different form of test statistic and distribution under the null hypothesis, but has the same rationale as SKAT. When there is no familial correlation, famSKAT is equivalent to SKAT. P-values for famSKAT are also calculated analytically without requiring permutation.

We demonstrate in our simulation studies that SKAT has inflated type I error in family samples when familial correlation is not appropriately considered. By contrast, famSKAT does not suffer from this issue and has correct type I error. We also show that famSKAT is more powerful than applying SKAT to an unrelated subset of the sample. For mixed datasets with both unrelated and related individuals, as the proportion of unrelated individuals decreases, the difference in power between SKAT and famSKAT increases, with famSKAT being always the more powerful approach of the two. Thus, by using famSKAT there is no need to reduce sample size by selecting an unrelated subset of individuals. Finally, we illustrate our approach by assessing the association between rare genetic variants using two glycemic traits in the Framingham Heart Study.

We first define notation and assumptions before we derive the SKAT statistic that accounts for familial correlation. Assuming a sample size of *n*, let the *n* × 1 vector of the quantitative trait ** y** follow a linear mixed effects model

where ** X** is an

where ** W** is the pre-specified diagonal weight matrix for the rare variants of

Under these assumptions, the phenotypic variance can be written as

The log likelihood for the linear mixed effects model is

To derive a score test for *H*_{0}:τ = 0, we first take the derivative with respect to *τ* to get

If we use the restricted maximum likelihood instead of the maximum likelihood method, we would get a different first term, but the same second term. In both cases, if we replace ** Σ** by its consistent estimator, and treat genotype matrix

Under the null hypothesis *τ* = 0, we can estimate

by fitting the null linear mixed effects model

The maximum likelihood estimators can be obtained using the function lmekin from R package kinship. We replace ** β**, and (and hence

as the famSKAT test statistic. Under the null hypothesis, the variance of the residuals is

Thus

where *λ _{i}* are the eigenvalues of the matrix . The p-value can be computed analytically by Davies’ method [Davies, 1980] or Kuonen’s saddlepoint method [Kuonen, 1999].

We note that even though the null model, test statistic, residual variance and null distribution of famSKAT have different forms compared to those of SKAT, they are directly connected. Actually, if we add a restriction on the model, famSKAT is equivalent to SKAT. Then

where is estimated from the null linear model

and famSKAT statistic becomes

with distribution under the null hypothesis

where *λ _{i}* are the eigenvalues of the matrix . They are proportional to SKAT statistic and null distribution matrix with the coefficient . We favor this form of null distribution matrix, rather than the form proposed in Wu et al. [2011], because usually the sample size

We note that famSKAT can also be used when we want to provide a known heritability coefficient *h*^{2} externally, rather than estimating it from the data. By the reparametrization

when *h*^{2} is known, we can use the generalized least square method to estimate only *σ*^{2} under the null model. Then we can follow the rest of the famSKAT procedure to perform the test.

To evaluate the type I error, we performed several simulation studies under the null hypothesis of no genetic association. We compared four approaches: famSKAT, burden test accounting for familial correlation (famBT), SKAT which only takes the unrelated subset of the sample (unrSKAT) and SKAT. We used Kuonen’s saddlepoint method [Kuonen, 1999] to compute the p-values for famSKAT, unrSKAT and SKAT. For famBT, we fit the linear mixed effects model

where *w _{j}* is the

We set the heritability of the trait

For each parameter setting, we simulated 100 genotype datasets with a total sample size of 1000 and 20 single nucleotide polymorphisms (SNP) with minor allele frequency (MAF) in the founders randomly sampled from a uniform distribution of 0.005 to 0.05, and with low (*r* = 0.1), moderate (*r* = 0.5), or high (*r* = 0.7) linkage disequilibrium (LD) between adjacent SNPs in the founders. The LD correlation between farther SNPs decays as an autoregressive model with order 1. We simulated haplotypes for unrelated founders with desired MAF and LD structure using the same procedure as HapSim [Montana, 2005], then we passed down the haplotypes to the next generation to simulate sib pairs, and took the remaining founders as unrelated individuals. Thus we created genotype datasets mixed with unrelated individuals and sib pairs, and let the proportion of unrelated individuals decrease from 75% to 50%, 25%, 0%. For each genotype dataset, 10,000 phenotype datasets including covariates were simulated by using the model

where ** age** is a vector of continuous covariate generated from a normal distribution with mean 50 and standard deviation 5,

We calculated the p-values of famSKAT, famBT, unrSKAT and SKAT by using the Wu weights [Wu et al., 2011], corresponding to the square of a beta density function of the observed MAF in the founders with parameters 1 and 25. We computed the empirical type I error at *α* levels of 0.01, 0.001 and 0.0001 by counting the proportion of p-values less than or equal to the corresponding *α* level in the 1 million genotype-phenotype datasets.

To evaluate the power of famSKAT, famBT and unrSKAT, we set the heritability of phenotype *h*^{2} = 0.5 and LD between adjacent SNPs in the founders *r* = 0.5, and performed simulations under different scenarios. For each parameter setting, we simulated 100 genotype datasets with a total sample size 1000 and 20 SNPs with MAF in the founders randomly sampled from a uniform distribution of 0.005 to 0.05. Similar to the null simulation setting, we simulated genotype datasets mixed with unrelated individuals and sib pairs, and changed the proportion of unrelated individuals from 75% to 50%, 25%, 0%. For each genotype dataset ** G**, 10,000 phenotype datasets including covariates were simulated by using the model

where ** age, sex** and

where *MAF _{i}* is the MAF used to generate the genotype dataset for causal SNP

where *R*^{2}, the total proportion of variance explained by all causal SNPs, was fixed at 1% for scenarios when all causal SNPs had effects in the same direction, and 5% for scenarios when 50% of the causal SNPs had positive effects and 50% had negative effects. ** D** is the LD correlation matrix for the 20 SNPs, and

Table 1 shows the empirical type I errors of famSKAT, famBT, unrSKAT and SKAT at different *α* levels in 3 LD scenarios and 4 scenarios for the proportion of unrelated individuals. The results suggest that when SKAT is directly applied to the full sample with correlated individuals, it has inflated type I error at all *α* levels. The empirical type I error tends to be higher when LD decays. In the contrast, famSKAT, famBT and unrSKAT retain the correct type I errors. Thus, in subsequent power simulations we only investigated these three approaches. The distributions of the p-values from the four approaches for the scenario of LD between adjacent SNPs *r* = 0.5 and proportion of unrelated individuals 0% were shown in Figure 1. We found that famSKAT, famBT and unrSKAT all had uniform distribution of the p-values, while the distribution of the pvalues from SKAT was more likely to be small, explaining the inflated type I error.

Distribution of the p-values for famSKAT, famBT, unrSKAT and SKAT from the null simulation with LD between adjacent SNPs 0.5 and proportion of unrelated individuals 0%.

Power simulation results of famSKAT, famBT and unrSKAT are shown in Figure 2. In all scenarios, 20 SNPs were analyzed. We simulated scenarios in which the proportion of causal SNPs was 20%, 50% or 80%, with effects in the same or opposite directions. As the proportion of unrelated individuals decreases from 75% to 50%, 25% and 0%, the sample size for unrSKAT also decreases from 875 to 750, 625 and 500. As a result, the power of unrSKAT also drops. In contrast, the power of famSKAT and famBT remains almost constant, regardless of the proportion of unrelated individuals. FamBT has higher power than famSKAT when the proportion of causal SNPs is greater than or equal to 50% and all causal SNPs have the same direction of effects, but it has almost no power in scenarios when causal SNPs have opposite directions of effects. Generally, famSKAT performs well in all these scenarios, suggesting that famSKAT is an omnibus method which does not have compromised power for different mixtures of related and unrelated individuals.

We used genotype data from Framingham SNP Health Association Resource (SHARe) and phenotype data from the Framingham Heart Study to analyze the association with two glycemic traits: fasting glucose and log-transformed fasting insulin. We restricted our analyses to SNPs with MAF less than 5% within 100kb of 16 gene regions selected for the prior association with fasting glucose, and 2 genes reported to be associated with log-transformed fasting insulin [Dupuis et al., 2010]. We adjusted the fasting glucose analysis for age and sex, and logtransformed insulin was additionally adjusted for body mass index. We performed famSKAT and famBT for all individuals with both genotype and phenotype available, and performed SKAT for only a subset of unrelated individuals. For comparison purpose, we calculated the MAF using a subset of unrelated individuals and applied Wu weights to all three methods.

We investigated the association between fasting glucose and rare genetic variants in 16 gene regions previously shown to be associated in large scale GWAS [Dupuis et al., 2010]: *ADCY5, ADRA2A, C2CD4B, CRY2, DGKB-AGMO*, *FADS1, G6PC2, GCK, GCKR, GLIS3, MADD, MTNR1B*, *PROX1, SLC2A2, SLC30A8*, and *TCF7L2*. The results are shown in Table 2. After adjusting for multiple testing using a Bonferroni correction, we detected no association between fasting glucose and rare genetic variants in the selected gene regions at the family-wise *α* level of 0.05, for all three methods. *CRY2* reaches the nominal significance level with a p-value of 0.0381 using famSKAT and 0.0085 using famBT, and G6PC2 reaches the nominal significance level with a p-value of 0.0418 using famSKAT, but none of these gene regions reaches nominal statistical significance when evaluated using unrSKAT.

We also investigated the association between log-transformed fasting insulin and rare genetic variants in 2 gene regions previously shown to be associated in large scale GWAS [Dupuis et al., 2010]: *GCKR* and *IGF1*. After adjusting for multiple testing using a Bonferroni correction, *IGF1* shows association with log transformed fasting insulin with a nominal p-value of 0.0232 using famSKAT and 0.0234 using famBT. Neither gene reaches even the nominal significance level using SKAT.

Table 2 also shows that the sample size for unrSKAT is much smaller than that for famSKAT and famBT, because even though the Framingham Heart Study is not a family-based cohort, there are many families in the study. Thus, by selecting unrelated individuals we greatly reduced the sample size. Because some SNPs with rare minor alleles may not be polymorphic in the subset of unrelated individuals, for some gene regions the number of SNPs for unrSKAT is smaller than the number of SNPs for famSKAT and famBT.

We also performed a genome-wide sliding window analysis on these two traits, as well as logtransformed HOMA-IR and HOMA-B, using SHARe genotype data. We only included SNPs with MAF less than 5% and ran the analysis using a sliding window of 500kb, with 250kb overlap each with previous and subsequent windows. We removed windows with 0 or 1 SNP, resulting in 10,546 windows for all autosomes with the number of SNPs ranging from 2 to 76 with median 18. No window reached the genome-wide significance using famSKAT, famBT or unrSKAT. The Q-Q plots for famSKAT are shown in Figure 3. There is minimal inflation of the p-values from this genome-wide analysis.

The computation time of famSKAT depends on both the sample size and the number of SNPs. The empirical run time of famSKAT, famBT and SKAT in analyzing sib pairs with indicated total sample sizes on a single computing node with 2.33 GHz CPU and 4 GB memory is shown in Figure 4. With a small sample size, the limiting step in famSKAT is fitting the null linear mixed effects model, so the computation time is comparable to that of famBT, which also requires fitting a linear mixed effects model. As the sample size increases, all three methods require more computation time, and the time of famSKAT and SKAT increases dramatically. Both famSKAT and SKAT require matrix calculation, and the limiting step in famSKAT becomes inverting the matrix , which takes about 90% of the computation time when the sample size is 5000. The genome-wide sliding window analysis of SHARe genotype data using a sliding window of 500kb takes about 5 hours for chromosome 1 on a single computing node with 2.33 GHz CPU and 4 GB memory.

In this paper, we propose famSKAT as an extension of SKAT which can be applied to data with familial correlation. We demonstrate that famSKAT is a general and flexible variance component score test approach, which is equivalent to SKAT when the familial variance component is set to 0. It can be applied to quantitative traits with unknown or known heritability.

Compared with famBT, famSKAT is advantageous in power when the proportion of causal SNPs in a genomic region is small, and when not all causal SNPs have the same direction of effects. As expected, famBT outperforms famSKAT when the proportion of causal SNPs is greater than or equal to 50% and all these SNPs have positive effects, but the performance of famSKAT in these scenarios is still satisfactory. In real data analysis, when we do not have sufficient a priori information about the proportion of causal SNPs or the directions of effects, famSKAT would be a better choice over famBT.

We show that when SKAT is inappropriately applied to correlated data, it has inflated type I error. Thus, the best we can do for SKAT is to select unrelated individuals from the whole sample. However, our power simulations demonstrate that this strategy is not in favor of power in many scenarios. In the contrast, we do not need to reduce our sample size if we use famSKAT. Our real data example from the Framingham Heart Study also shows that SKAT does not even have an observation which reaches the nominal significance level of 0.05.

Common genetic variants at 16 gene regions we chose for fasting glucose and 2 gene regions we chose for log-transformed fasting insulin have been shown to be associated with either trait in large GWAS [Dupuis et al., 2010]. However, we do not have solid evidence to show that there is strong association between either trait and the rare genetic variants in these regions. We noticed that the sample size in this analysis was far smaller than in Dupuis et al. [2010], which reduced the power. In addition, the SHARe project was not specifically designed for rare variants analysis, so most SNPs in our genotype dataset are common SNPs which were excluded from the analysis. With the progress of sequencing studies, we should be able to identify much more rare variants and perform the candidate gene or even genome-wide analysis again using the new genotype dataset with dense rare genetic variants. On the other hand, some gene regions may be truly associated with the trait only through common SNPs, so we do not expect to identify the association with rare genetic variants for all these gene regions we selected.

With the development in sequencing technology and the lowering of the cost, sequencing data which contain a lot of rare genetic variants have become available, not only for case-control studies, but also for cohorts that include family members. Based on SKAT, one of the most powerful rare genetic variants analysis methods to date, we developed famSKAT in the hope of facilitating rare genetic variants analysis to identify novel genes associated with quantitative traits. With famSKAT, cohorts with family data can perform the association analysis with rare genetic variants, using as much data as possible, without having to select unrelated individuals from the pedigree. FamSKAT has been implemented in R, and source code is available at http://www.bumc.bu.edu/linga/research/publications/famskat/.

For calculating the p-values, we recommend using Kuonen’s saddlepoint method [Kuonen, 1999] instead of Davies’ method [Davies, 1980]. As a method based on numerical integration, Davies’ method requires specifying the accuracy. When the p-value is expected to be very small, Davies’ method cannot calculate it accurately. Table 3 shows this numerical issue in a power simulation context. Davies’ method suffers from negative and zero p-values (and possibly significant roundoff error) regardless of the accuracy specified. In the contrast, Kuonen’s method does not have such issues. Thus, if we perform a genome-wide rare variants analysis using sequence data, from which we expect extreme low p-values, Kuonen’s method might be a better choice over Davies’ method.

Even though famSKAT was developed for analyzing rare genetic variants, it can also be used for common variant analysis, combined common and rare variant analysis or conditional association analyses. Depending on the research hypothesis, common variants can be treated as fixed effects in the model, or random effects along with the rare genetic variants. During the review of this paper, we became aware that Schifano et al. [2012] had recently developed a SNP set association analysis approach for common variants analysis in family data, which is essentially equivalent to our method. The use of famSKAT combined with the collapsing of some very rare genetic variants such as singletons is also possible. Similar with SKAT, external weights based on annotation information or functional prediction can be incorporated to further boost the power.

The authors thank Dr. Thomas Lumley for his insights in Kuonen’s saddlepoint method. This research was partially supported by NIH awards R01 DK078616, U01 DK85526 and K24 DK080140. A portion of this research was conducted using the Linux Clusters for Genetic Analysis (LinGA) computing resources at Boston University Medicine Campus. The Framingham Heart Study is conducted and supported by the National Heart, Lung, and Blood Institute (NHLBI) in collaboration with Boston University (Contract No. N01-HC-25195). This work was partially supported by a contract with Affymetrix, Inc for genotyping services (Contract No. N02-HL-6-4278).

- Almasy L, Blangero J. Multipoint quantitative-trait linkage analysis in general pedigrees. Am J Hum Genet. 1998;62:1198–1211. [PubMed]
- Davies RB. The distribution of a linear combination of chi-square random variables. Journal of the Royal Statistical Society. Series C (Applied Statistics) 1980;29:323–333.
- Dupuis J, Langenberg C, Prokopenko I, et al. New genetic loci implicated in fasting glucose homeostasis and their impact on type 2 diabetes risk. Nat Genet. 2010;42:105–116. [PMC free article] [PubMed]
- Eichler EE, Flint J, Gibson G, et al. Missing heritability and strategies for finding the underlying causes of complex disease. Nat Rev Genet. 2010;11:446–450. [PMC free article] [PubMed]
- Falk CT, Rubinstein P. Haplotype relative risks: an easy reliable way to construct a proper control sample for risk calculations. Ann Hum Genet. 1987;51:227–233. [PubMed]
- Han F, Pan W. A data-adaptive sum test for disease association with multiple common or rare variants. Hum Hered. 2010;70:42–54. [PMC free article] [PubMed]
- Hoffmann TJ, Marini NJ, Witte JS. Comprehensive approach to analyzing rare genetic variants. PLoS One. 2010;5:e13584. [PMC free article] [PubMed]
- Kuonen D. Saddlepoint approximations for distributions of quadratic forms in normal variables. Biometrika. 1999;86:929–935.
- Kwee LC, Liu D, Lin X, Ghosh D, Epstein MP. A powerful and flexible multilocus association test for quantitative traits. Am J Hum Genet. 2008;82:386–397. [PubMed]
- Li B, Leal SM. Methods for detecting associations with rare variants for common diseases: application to analysis of sequence data. Am J Hum Genet. 2008;83:311–321. [PubMed]
- Lin X. Variance component testing in generalised linear models with random effects. Biometrika. 1997;84:309–326.
- Liu D, Lin X, Ghosh D. Semiparametric regression of multidimensional genetic pathway data: least-squares kernel machines and linear mixed models. Biometrics. 2007;63:1079–1088. [PMC free article] [PubMed]
- Madsen BE, Browning SR. A groupwise association test for rare mutations using a weighted sum statistic. PLoS Genet. 2009;5:e1000384. [PMC free article] [PubMed]
- Montana G. HapSim: a simulation tool for generating haplotype data with pre-specified allele frequencies and LD coefficients. Bioinformatics. 2005;21:4309–4311. [PubMed]
- Morgenthaler S, Thilly WG. A strategy to discover genes that carry multi-allelic or mono-allelic risk for common diseases: a cohort allelic sums test (CAST) Mutat Res. 2007;615:28–56. [PubMed]
- Morris AP, Zeggini E. An evaluation of statistical approaches to rare variant analysis in genetic association studies. Genet Epidemiol. 2010;34:188–193. [PMC free article] [PubMed]
- Ott J. Statistical properties of the haplotype relative risk. Genet Epidemiol. 1989;6:127–130. [PubMed]
- Rabinowitz D, Laird N. A unified approach to adjusting association tests for population admixture with arbitrary pedigree structure and arbitrary missing marker information. Hum Hered. 2000;50:211–223. [PubMed]
- Schifano ED, Epstein MP, Bielak LF, Jhun MA, Kardia SLR, Peyser PA, Lin X. SNP set association analysis for familial data. Genet Epidemiol. 2012;36:797–810.
- Spielman RS, McGinnis RE, Ewens WJ. Transmission test for linkage disequilibrium: the insulin gene region and insulin-dependent diabetes mellitus (IDDM) Am J Hum Genet. 1993;52:506–516. [PubMed]
- Terwilliger JD, Ott J. A haplotype-based ‘haplotype relative risk’ approach to detecting allelic associations. Hum Hered. 1992;42:337–346. [PubMed]
- Wu MC, Lee S, Cai T, Li Y, Boehnke M, Lin X. Rare-variant association testing for sequencing data with the sequence kernel association test. Am J Hum Genet. 2011;89:82–93. [PubMed]

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's Canada Institute for Scientific and Technical Information in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |