Home | About | Journals | Submit | Contact Us | Français |

**|**PLoS Genet**|**v.5(11); 2009 November**|**PMC2777973

Formats

Article sections

Authors

Related links

PLoS Genet. 2009 November; 5(11): e1000741.

Published online 2009 November 26. doi: 10.1371/journal.pgen.1000741

PMCID: PMC2777973

Sungho Won,^{
1
,}^{
2
} Jemma B. Wilk,^{
3
} Rasika A. Mathias,^{
4
} Christopher J. O'Donnell,^{
5
,}^{
6
} Edwin K. Silverman,^{
7
,}^{
8
,}^{
9
} Kathleen Barnes,^{
10
} George T. O'Connor,^{
11
} Scott T. Weiss,^{
7
,}^{
9
,}^{
12
} and Christoph Lange^{
9
,}^{
12
,}^{
13
,}^{
*
}

Nicholas J. Schork, Editor^{}

University of California San Diego and The Scripps Research Institute, United States of America

* E-mail: ude.dravrah.hpsh@egnalc

Conceived and designed the experiments: SW CL. Performed the experiments: SW. Analyzed the data: SW JBW RM CJO EKS KB GTO STW. Contributed reagents/materials/analysis tools: SW. Wrote the paper: SW CL.

Received 2009 May 18; Accepted 2009 October 26.

Copyright This is an open-access article distributed under the terms of the Creative Commons Public Domain declaration which stipulates that, once placed in the public domain, this work may be freely reproduced, distributed, transmitted, modified, built upon, or otherwise used by anyone for any lawful purpose.

This is an open-access article distributed under the terms of the Creative Commons Public Domain declaration, which stipulates that, once placed in the public domain, this work may be freely reproduced, distributed, transmitted, modified, built upon, or otherwise used by anyone for any lawful purpose.

This article has been cited by other articles in PMC.

For genome-wide association studies in family-based designs, we propose a new, universally applicable approach. The new test statistic exploits all available information about the association, while, by virtue of its design, it maintains the same robustness against population admixture as traditional family-based approaches that are based exclusively on the within-family information. The approach is suitable for the analysis of almost any trait type, e.g. binary, continuous, time-to-onset, multivariate, etc., and combinations of those. We use simulation studies to verify all theoretically derived properties of the approach, estimate its power, and compare it with other standard approaches. We illustrate the practical implications of the new analysis method by an application to a lung-function phenotype, forced expiratory volume in one second (FEV1) in 4 genome-wide association studies.

In genome-wide association studies, the multiple testing problem and confounding due to population stratification have been intractable issues. Family-based designs have considered only the transmission of genotypes from founder to nonfounder to prevent sensitivity to the population stratification, which leads to the loss of information. Here we propose a novel analysis approach that combines mutually independent FBAT and screening statistics in a robust way. The proposed method is more powerful than any other, while it preserves the complete robustness of family-based association tests, which only achieves much smaller power level. Furthermore, the proposed method is virtually as powerful as population-based approaches/designs, even in the absence of population stratification. By nature of the proposed method, it is always robust as long as FBAT is valid, and the proposed method achieves the optimal efficiency if our linear model for screening test reasonably explains the observed data in terms of covariance structure and population admixture. We illustrate the practical relevance of the approach by an application in 4 genome-wide association studies.

During the analysis phase of genome-wide association studies, one is confronted with numerous statistical challenges. One of them is the decision about the “right” balance between maximization of the statistical power and, at the same time, robustness against confounding. In family-based designs, the possible range of analysis options spans from a traditional family-based association analysis [1]–[4], e.g. TDT, PDT, FBAT, to the application of population-based analysis methods that have been adapted to family-data [1]–[3]. While, by definition, the first group of approaches is completely immune to population admixture and model misspecification of the phenotype, and can be applied to any phenotype that is permissible in the family-based association testing framework (FBAT [4]–[6]), the second category of approaches maximizes the statistical power by a population-based analysis. The phenotypes are modeled as a function of the genotype, and population-based methods such as genomic control [7],[8], STRUCTURE [9] and EIGENSTRAT [10], are applied to account for the effects of population admixture and stratification. Hybrid-approaches that combine elements of both population-based and family-based analysis methods, e.g. VanSteen algorithm [11] and Ionita weighting-schemes [12],[13] have been suggested to bridge between the 2 types of analysis strategies. Contrary to the other methods that combine family data and unrelated samples [14]–[17], such hybrid testing strategies maintain the 2 key features of the family-based association tests: The robustness against confounding due to population admixture and heterogeneity, and the analysis flexibility of the approach with respect to the choice of the target phenotype. Such 2-stage testing strategies utilize the information about the association at a population-level, the between-family component, to prioritize SNPs for the second step of the approach in which they are tested formally for association with a family-based test. The hybrid approaches can achieve power levels that are similar to approaches in which standard population-based methods are applied to family-data, but the optimal combination of the 2 sources of information (the between-family component and the within-family component) is not straightforward in the hybrid approaches.

In this communication, we propose a new family-based association test for genome-wide association studies that combines all sources of information about association, the between and the within-family information, into one single test statistic. The new test is robust against population-admixture even though both components, the between and the within-family components, are used to assess the evidence for association. The approach is applicable to all phenotypes or combinations of phenotypes that can be handled in the FBAT-approach, e.g. binary, continuous, time-to-onset, multivariate, etc [4]–[6],[18]. While the correct model specification for the phenotypes will increase the power of the proposed test statistic, misspecification of the phenotypic model does not affect the validity of the approach. Using extensive simulation studies, we verify the theoretically derived properties of the test statistic, assess its power and compare it with other standard approaches. An application to the Framing heart study (FHS) illustrates the value of the approach in practice. A new genetic locus for the lung-function phenotype, FEV1 (forced expiratory volume in the first second) is discovered and replicated in 3 independent, genome-wide association studies.

We assume that in a family-based association study, *n* family members have been genotyped at *m* loci with a genome-wide SNP-chip. For each marker locus, a family-based association test is constructed based on the offspring phenotype and the within-family information. The within-family information is defined as the difference between the observed, genetic marker score and the expected, genetic marker score, which is computed conditional upon both the parental genotypes/sufficient statistic [19] under the assumption of Mendelian transmissions. We denote the family-based association test for the *i*th marker locus by *FBAT _{i}*. Such an FBAT statistic can be the standard TDT, an FBAT for quantitative/qualitative traits, FBAT-GEE for multivariate traits, etc [4],[6],[18],[20],[21]. Similarly, for the

In order to construct a family-based association test that incorporates both the within and the between-family information, the Z-statistics that correspond to the p-values of *FBAT _{i}* and

where the parameters *w _{FBAT}* and

The “screening statistics” *T _{i}* are sorted based on their evidence for association so that

Furthermore, it is important to note that, instead of the Liptak-method, Fisher's method for combining p-values could have been used as well to construct an overall family-based association test which would have the same robustness properties as the overall-test based on the Liptak-method. However, simulation studies (data not shown) suggest that the highest power levels are consistently achieved with the Liptak method. We therefore omit the approach based on Fisher's method here.

In the first part of the simulation study, the type-1 error of the proposed family-based association test denoted as LIP was assessed in the absence and in the presence of population admixture, and we use the Wald test based on the conditional mean model [22] with between-family component for *pT _{i}* in our all simulations. For various scenarios, we verified that the proposed overall family-based association test maintains the

For simplicity, we assume in the simulation studies that the random samples are given, i.e. no ascertainment, and that the parental genotypes are known. Assuming Hardy-Weinberg equilibrium, the parental genotypes are generated by drawing from Bernoulli distributions defined by the allele frequencies. The offspring genotypes are obtained by simulated Mendelian transmissions from the parents to the offspring. For the *j*th trio, the offspring phenotype *Y _{j}* is simulated from a Normal distribution with mean

For scenarios in which population admixture is present, we assume that the admixture is created by the presence of 2 subpopulations whose phenotypic means differ by 0.2. The allele frequencies for each marker in the two subpopulations are generated by the Balding-Nichols model [25]. That is, for each marker, the allele frequency in an ancestral population is generated from a uniform distribution between 0.1 and 0.9, *U*(0.1, 0.9). Then, the marker allele frequencies for the two subpopulations are independently sampled from the beta distributions (*p*(1−*F _{ST}*)/

In the absence and presence of the population stratification (*F _{ST}*=0.05, 0.1, 0.2, and 0.3), Table 1 shows the empirical type-1 error rates of the overall association test statistic

In the next set of simulation studies, we assess the effects of the local population stratification on the overall family-based association test. We generate local population stratification under the following assumptions: there are two subpopulations, *G*
_{1} and *G*
_{2} which distinguish themselves from each other in 2 marker regions. We assume that a subject can be from all possible 4 combinations at the 2 particular regions, e.g. (*G*
_{1}, *G*
_{1}), (*G*
_{1}, *G*
_{2}), (*G*
_{2}, *G*
_{1}) and (*G*
_{2}, *G*
_{2}). Both regions consist of 10K SNPs and 90K SNPs respectively and if subjects are from the same subpopulation in each genetic region, their assumed allele frequencies of the markers in the corresponding region are equal. For example, the allele frequencies of each marker in the marker region 1 are the same for samples in (*G*
_{1}, *G*
_{1}) and (*G*
_{1}, *G*
_{2}), but they are different for (*G*
_{1}, *G*
_{1}) and (*G*
_{2}, *G*
_{2}). In the simulation study, we generate the parental genotypes based on these allele frequency assumptions and obtain the offspring genotypes based on simulated Mendelian transmissions. Using the Balding-Nichols model we considered *F _{ST}*'s of 0.001, 0.005, 0.01 and 0.05 in the simulation studies. The offspring's phenotype was generated under the null hypothesis, but we assumed that each sub-population strata had a different phenotypic mean: 0 for (

For the analysis of quantitative traits, Table 4 provides the empirical power for 500K GWAS from 2000 replicates when there is no population stratification. Under the assumption of an additive disease model for a quantitative trait, the genetic effect, *a*, is given as a function of the heritability, *h*
^{2}, the minor allele frequency *p _{D}*

For the assessment of the severity of pulmonary diseases, the lung volume of air that a subject can blow out within one second after taking a deep breath is an important endo-phenotype. It is referred to as the forced expiratory volume in one second (FEV1). FEV1 is an important measure for lung function and we apply the proposed method to a family-based GWAS of FEV1. The proposed method is applied to 550K GWAS Framingham Heart Study (FHS) data set for FEV1, and then we confirm whether the selected SNPs are replicated in the British 1958 Birth Cohort (BBC), another population sample, as well as two samples of asthmatics in the the Childhood Asthma management program (CAMP) [30] and an Afro-Caribbean group of families from Barbados (ACG) [31]. In FHS, 9,274 subjects were genotyped and 10,816 subjects of those had at least one FEV1 measurement. Of the 8637 participants with genotyping and FEV1 measures, only those with a call rate of 97% or higher were included. We adjusted the covariates, age, sex and the quadratic term of height that are known to be associated with FEV1. For within-family components, the FBAT statistic for quantitative trait was applied. Markers were excluded from the analysis if the number of informative families was less than 20, or the minor allele frequency was less than 0.05. In total, 306,264 SNPs were used for analysis and, based on the number of SNPs, rank-based empirical p-values, *pT _{i}*, and the genome-wide significance level was obtained with Bonferroni correction. When we let

Table 6 shows the p-values for the top 10 SNPs from the proposed method. In our analysis, the genome-wide significance level at 0.05 is 1.636×10^{−7} and our results show that only the first ranked SNP, rs805294, is significant at the genome-wide level 0.2 with Bonferroni correction. For rs805294, we also checked the significance in other data sets, BBC, CAMP [30] and ACG [31]. In CAMP, 1215 subjects in 422 families were genotyped and there are 488 informative trios for rs809254 and in ACG, there were only 33 informative trios (Table 7). In the BBC, 1372 unrelated subjects were genotyped with the Affymetrix chip and 1323 unrelated subjects genotyped with the Illumina chip. In CAMP and ACG, age, sex and the quadratic terms of heights were adjusted and in the BBC, age, sex, height, recent chest infection and nurse were adjusted. Table 7 also shows that rs805294 is significant and their directions are same for the considered studies except for the ACG sample. In particular, in the ACG study, the MAF of the SNP is different from other studies, which indicates a different local LD structure; The ACG sample is from an Afro-Caribbean population, contrary to the other studies which only include Caucasian study subjects. In addition, the ACG sample lacks statistical power for this particular SNP, i.e. there are only 33 informative trios in this sample. Thus, the inconsistent finding in the ACG study could be attributable to genetic heterogeneity, i.e. different local LD structure/flip-flop phenomena [32], or insufficient statistical power. For meta analysis, the sample sizes are used as weights for Liptak's method and we use 131351=FHSBBCCAMPACG as weights because the between-family information is used only for FHS. If the p-value from Illumina gene chip in BBC and the p-values from FHS, CAMP and ACG are combined, then the p-values by Liptak's method using proposed weights and Fisher's method are 1.534×10^{−8} and 1.081×10^{−7} respectively, and they become 4.625×10^{−9} and 3.554×10^{−8} if the p-values from one-tailed tests are used for BBC, CAMP and ACG with the same direction of FHS. If the p-value from the Affymetrix gene chip in BBC is combined with the other studies, then they are 3.787×10^{−8} (Liptak's method) and 1.890×10^{−7} (Fisher's method) for two-tailed tests, and 1.098×10^{−8} (Liptak's method) and 6.236×10^{−8} (Fisher's method) for one-tailed tests. As a result we can conclude that rs805294 is significantly associated with FEV1 at a genome-wide scale and the gene, LY6G6C, associated with rs805293 will be investigated in further studies.

Genome-wide association studies have become one of the most important tools for the identification of new disease loci in the human genome. However, even though advances in genotyping technology have enabled a new generation of genetic association studies that provide robust and replicable findings, population stratification/genetic heterogeneity and the multiple testing problems continue to be the major issues in the statistical analysis that have to be resolved in each study. While family-based association tests provide analysis results that are completely robust against confounding due to population-substructures, the analysis approach is not optimal in terms of statistical power. Numerous approaches have been suggested to minimize this disadvantage of family-based association tests but the previous approaches had to compromise either in terms of robustness or in terms of efficiency.

In this communication, we develop an approach that efficiently utilizes all available data, while maintaining complete robustness against confounding due to population substructure. The proposed methods combines the p-values of the family-based tests (the within-component) with the rank-based p-values for population-based analysis (the between component) to achieve optimal power levels. The use of rank-based p-values for the population-based component is similar in spirit to the genomic control approach. In principle, the genomic control functions as rescaling the variance inflated due to population stratification under the assumption of the constant *F _{ST}*. Rank-based p-value directly rescales the statistics based on their ranks, which always generates the uniformly distributed p-value and provides validity even for varying

Although our simulations are limited to independent unascertained samples and quantitative traits, the proposed work can be easily extended to ascertained samples, large pedigree, or different trait types, etc. By replacing the parental genotypes with the sufficient statistics by Rabinowitz&Laird [19], the FBAT-statistic and the screening-statistic can be adopted straight-forwardly to designs with extended pedigrees [23]. Similarly, parental phenotypes can be incorporated into the conditional mean model [23] or its non-parametric extensions [33] as additional outcome variables. The optimal weights can vary between the different scenarios and further theoretical investigation is currently ongoing, but limited initial simulation studies suggest that equal weights, while not always the most powerful choice in such situation, will always result in more powerful analysis than currently used methods.

The validity of the proposed method.

(0.04 MB DOC)

Click here for additional data file.^{(56K, doc)}

Framingham Heart Study genotype and phenotype data are publicly available through the NHLBI's SNP Health Association Resource (SHARe) initiative (http://public.nhlbi.nih.gov/GeneticsGenomics/home/share.aspx). We acknowledge the CAMP investigators and research team for collection of CAMP Genetic Ancillary Study data and use of genotype data from the British 1958 Birth Cohort DNA collection. We further acknowledge the families in Barbados for their generous participation in this study. We are grateful to Drs. Raana Naidu, Paul Levett, Malcolm Howitt and Pissamai Maul, Trevor Maul, and Bernadette Gray for their contributions in the field; Dr. Malcolm Howitt and the Polyclinic and A&E Department physicians in Barbados for their efforts and their continued support; as well as Drs. Henry Fraser and Anselm Hennis at the Chronic Disease Research Centre.

The authors have declared that no competing interests exist.

The CAMP Genetics Ancillary Study is supported by U01 HL075419, U01 HL65899, P01 HL083069, R01 HL086601, and T32 HL07427 from the National Heart, Lung, and Blood Institute, National Institutes of Health. CL is supported by the National Institutes of Health grant R01MH081862. Framingham Heart Study genotype and phenotype data are publicly available through the NHLBI's SNP Health Association Resource (SHARe) initiative (http://public.nhlbi.nih.gov/GeneticsGenomics/home/share.aspx). The British 1958 Birth Cohort DNA collection is funded by the Medic Research Council grant G00000934 and the Wellcome Trust grant 068545/Z/02. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

1. Aulchenko YS, de Koning DJ, Haley C. Genomewide rapid association using mixed model and regression: a fast and simple method for genomewide pedigree-based quantitative trait loci association analysis. Genetics. 2007;177:577–585. [PubMed]

2. Chen WM, Abecasis GR. Family-based association tests for genomewide association scans. Am J Hum Genet. 2007;81:913–926. [PubMed]

3. Elston RC, Gray-McGuire C. A review of the ‘Statistical Analysis for Genetic Epidemiology’ (S.A.G.E.) software package. Hum Genomics. 2004;1:456–459. [PMC free article] [PubMed]

4. Lange C, Blacker D, Laird NM. Family-based association tests for survival and times-to-onset analysis. Stat Med. 2004;23:179–189. [PubMed]

5. Laird NM, Horvath S, Xu X. Implementing a unified approach to family-based tests of association. Genet Epidemiol. 2000;19(Suppl 1):S36–42. [PubMed]

6. Lange C, Silverman EK, Xu X, Weiss ST, Laird NM. A multivariate family-based association test using generalized estimating equations: FBAT-GEE. Biostatistics. 2003;4:195–206. [PubMed]

7. Devlin B, Roeder K. Genomic control for association studies. Biometrics. 1999;55:997–1004. [PubMed]

8. Devlin B, Roeder K, Wasserman L. Genomic control, a new approach to genetic-based association studies. Theor Popul Biol. 2001;60:155–166. [PubMed]

9. Pritchard JK, Stephens M, Donnelly P. Inference of population structure using multilocus genotype data. Genetics. 2000;155:945–959. [PubMed]

10. Price AL, Patterson NJ, Plenge RM, Weinblatt ME, Shadick NA, et al. Principal components analysis corrects for stratification in genome-wide association studies. Nat Genet. 2006;38:904–909. [PubMed]

11. Van Steen K, McQueen MB, Herbert A, Raby B, Lyon H, et al. Genomic screening and replication using the same data set in family-based association testing. Nat Genet. 2005;37:683–691. [PubMed]

12. Ionita-Laza I, McQueen MB, Laird NM, Lange C. Genomewide weighted hypothesis testing in family-based association studies, with an application to a 100K scan. Am J Hum Genet. 2007;81:607–614. [PubMed]

13. Murphy A, Weiss ST, Lange C. Screening and replication using the same data set: testing strategies for family-based studies in which all probands are affected. PLoS Genet. 2008;4:e1000197. doi: 10.1371/journal.pgen.1000197. [PMC free article] [PubMed]

14. Nagelkerke NJ, Hoebee B, Teunis P, Kimman TG. Combining the transmission disequilibrium test and case-control methodology using generalized logistic regression. Eur J Hum Genet. 2004;12:964–970. [PubMed]

15. Epstein MP, Veal CD, Trembath RC, Barker JN, Li C, et al. Genetic association analysis using data from triads and unrelated subjects. Am J Hum Genet. 2005;76:592–608. [PubMed]

16. Chen YH, Lin HW. Simple association analysis combining data from trios/sibships and unrelated controls. Genet Epidemiol. 2008;32:520–527. [PubMed]

17. Zhu X, Li S, Cooper RS, Elston RC. A unified association analysis approach for family and unrelated samples correcting for stratification. Am J Hum Genet. 2008;82:352–365. [PubMed]

18. Lange C, DeMeo DL, Laird NM. Power and design considerations for a general class of family-based association tests: quantitative traits. Am J Hum Genet. 2002;71:1330–1341. [PubMed]

19. Rabinowitz D, Laird N. A unified approach to adjusting association tests for population admixture with arbitrary pedigree structure and arbitrary missing marker information. Hum Hered. 2000;50:211–223. [PubMed]

20. Spielman RS, Ewens WJ. A sibship test for linkage in the presence of association: the sib transmission/disequilibrium test. Am J Hum Genet. 1998;62:450–458. [PubMed]

21. Lange C, Laird NM. On a general class of conditional tests for family-based association studies in genetics: the asymptotic distribution, the conditional power, and optimality considerations. Genet Epidemiol. 2002;23:165–180. [PubMed]

22. Lange C, Lyon H, DeMeo D, Raby B, Silverman EK, et al. A new powerful non-parametric two-stage approach for testing multiple phenotypes in family-based association studies. Hum Hered. 2003;56:10–17. [PubMed]

23. Lange C, DeMeo D, Silverman EK, Weiss ST, Laird NM. Using the noninformative families in family-based association tests: a powerful new testing strategy. Am J Hum Genet. 2003;73:801–811. [PubMed]

24. Liptak T. On the combination of independent tests. Magyar Tud Akad Mat Kutato' IntKo''zl. 1958;3:171.

25. Balding DJ, Nichols RA. A method for quantifying differentiation between populations at multi-allelic loci and its implications for investigating identity and paternity. Genetica. 1995;96:10. [PubMed]

26. Cavalli-Sforza LL, Piazza A. Human genomic diversity in Europe: a summary of recent research and prospects for the future. Eur J Hum Genet. 1993;1:16. [PubMed]

27. Armitage P. Tests for linear trends in proportions and frequencies. Biometrics. 1955;11:375–386.

28. Abecasis GR, Cardon LR, Cookson WO. A general test of association for quantitative traits in nuclear families. Am J Hum Genet. 2000;66:279–292. [PubMed]

29. Diao G, Lin DY. Improving the power of association tests for quantitative traits in family studies. Genet Epidemiol. 2006;30:301–313. [PubMed]

30. The Childhood Asthma Management Program (CAMP): design, rationale, and methods. Childhood Asthma Management Program Research Group. Control Clin Trials. 1999;20:91–120. [PubMed]

31. Barnes KC, Neely JD, Duffy DL, Freidhoff LR, Breazeale DR, et al. Linkage of asthma and total serum IgE concentration to markers on chromosome 12q: evidence from Afro-Caribbean and Caucasian populations. Genomics. 1996;37:41–50. [PubMed]

32. Lin PI, Vance JM, Pericak-Vance MA, Martin ER. No gene is an island: the flip-flop phenomenon. Am J Hum Genet. 2007;80:531–538. [PubMed]

33. Jiang H, Harrington D, Raby BA, Bertram L, Blacker D, et al. Family-based association test for time-to-onset data with time-dependent differences between the hazard functions. Genet Epidemiol. 2006;30:124–132. [PubMed]

Articles from PLoS Genetics are provided here courtesy of **Public Library of Science**

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |