For genome-wide association studies in family-based designs, we propose a powerful two-stage testing strategy that can be applied in situations in which parent-offspring trio data are available and all offspring are affected with the trait or disease under study. In the first step of the testing strategy, we construct estimators of genetic effect size in the completely ascertained sample of affected offspring and their parents that are statistically independent of the family-based association/transmission disequilibrium tests (FBATs/TDTs) that are calculated in the second step of the testing strategy. For each marker, the genetic effect is estimated (without requiring an estimate of the SNP allele frequency) and the conditional power of the corresponding FBAT/TDT is computed. Based on the power estimates, a weighted Bonferroni procedure assigns an individually adjusted significance level to each SNP. In the second stage, the SNPs are tested with the FBAT/TDT statistic at the individually adjusted significance levels. Using simulation studies for scenarios with up to 1,000,000 SNPs, varying allele frequencies and genetic effect sizes, the power of the strategy is compared with standard methodology (e.g., FBATs/TDTs with Bonferroni correction). In all considered situations, the proposed testing strategy demonstrates substantial power increases over the standard approach, even when the true genetic model is unknown and must be selected based on the conditional power estimates. The practical relevance of our methodology is illustrated by an application to a genome-wide association study for childhood asthma, in which we detect two markers meeting genome-wide significance that would not have been detected using standard methodology.
The current state of genotyping technology has enabled researchers to conduct genome-wide association studies of up to 1,000,000 SNPs, allowing for systematic scanning of the genome for variants that might influence the development and progression of complex diseases. One of the largest obstacles to the successful detection of such variants is the multiple comparisons/testing problem in the genetic association analysis. For family-based designs in which all offspring are affected with the disease/trait under study, we developed a methodology that addresses this problem by partitioning the family-based data into two statistically independent components. The first component is used to screen the data and determine the most promising SNPs. The second component is used to test the SNPs for association, where information from the screening is used to weight the SNPs during testing. This methodology is more powerful than standard procedures for multiple comparisons adjustment (i.e., Bonferroni correction). Additionally, as only one data set is required for screening and testing, our testing strategy is less susceptible to study heterogeneity. Finally, as many family-based studies collect data only from affected offspring, this method addresses a major limitation of previous methodologies for multiple comparisons in family-based designs, which require variation in the disease/trait among offspring.
When testing for genetic effects, failure to account for a gene-environment interaction can mask the true association effects of a genetic marker with disease. Family-based association tests are popular because they are completely robust to population substructure and model misspecification. However, when testing for an interaction, failure to model the main genetic effect correctly can lead to spurious results. Here we propose a family-based test for interaction that is robust to model misspecification, but still sensitive to an interaction effect, and can handle continuous covariates and missing parents. We extend the FBAT-I gene-environment interaction test for dichotomous traits to using both trios and sibships. We then compare this extension to joint tests of gene and gene-environment interaction, and compare the joint test additionally to the main effects test of the gene. Lastly we apply these three tests to a group of nuclear families ascertained according to affection with Bipolar Disorder.
genetic association; genetic interaction; family-based test; FBAT-I
For genome-wide association studies in family-based designs, we propose a new, universally applicable approach. The new test statistic exploits all available information about the association, while, by virtue of its design, it maintains the same robustness against population admixture as traditional family-based approaches that are based exclusively on the within-family information. The approach is suitable for the analysis of almost any trait type, e.g. binary, continuous, time-to-onset, multivariate, etc., and combinations of those. We use simulation studies to verify all theoretically derived properties of the approach, estimate its power, and compare it with other standard approaches. We illustrate the practical implications of the new analysis method by an application to a lung-function phenotype, forced expiratory volume in one second (FEV1) in 4 genome-wide association studies.
In genome-wide association studies, the multiple testing problem and confounding due to population stratification have been intractable issues. Family-based designs have considered only the transmission of genotypes from founder to nonfounder to prevent sensitivity to the population stratification, which leads to the loss of information. Here we propose a novel analysis approach that combines mutually independent FBAT and screening statistics in a robust way. The proposed method is more powerful than any other, while it preserves the complete robustness of family-based association tests, which only achieves much smaller power level. Furthermore, the proposed method is virtually as powerful as population-based approaches/designs, even in the absence of population stratification. By nature of the proposed method, it is always robust as long as FBAT is valid, and the proposed method achieves the optimal efficiency if our linear model for screening test reasonably explains the observed data in terms of covariance structure and population admixture. We illustrate the practical relevance of the approach by an application in 4 genome-wide association studies.
For genome-wide association studies in family-based designs, a new, universally applicable approach is proposed. Using a modified Liptak’s method, we combine the p-value of the family-based association test (FBAT) statistic with the p-value for the Van Steen-statistic. The Van Steen-statistic is independent of the FBAT-statistic and utilizes information that is ignored by traditional FBAT-approaches. The new test statistic takes advantages of all available information about the genetic association, while, by virtue of its design, it achieves complete robustness against confounding due to population stratification. The approach is suitable for the analysis of almost any trait type for which FBATs are available, e.g. binary, continuous, time to-onset, multivariate, etc. The efficiency and the validity of the new approach depend on the specification of a nuisance/tuning parameter and the weight parameters in the modified Liptak’s method. For different trait types and ascertainment conditions, we discuss general guidelines for the optimal specification of the tuning parameter and the weight parameters. Our simulation experiments and an application to an Alzheimer study show the validity and the efficiency of the new method, which achieves power levels that are comparable to those of population-based approaches.
FBAT; Liptak’s method; Tuning parameter
Several family-based approaches for testing genetic association with traits obtained from longitudinal or repeated measurement studies have been previously proposed. These approaches utilize the multivariate data more efficiently by using estimated optimal weights to combine univariate tests. We show that these FBAT approaches are still robust against hidden population stratification, but their power can be heavily affected since the estimated weights might provide poor approximation of the true theoretical optimal weights with the presence of population stratification. We introduce a permutation-based approach FBAT-MinP and an equal combination approach FBAT-EW, both of which do not involve the use of estimated weights. Through simulation studies, FBAT-MinP and FBAT-EW are shown to be powerful even in the presence of population stratification, when other approaches may substantially lose their power. An application of these approaches to the Childhood Asthma Management Program (CAMP) study data for testing an association between body mass index and a previously reported candidate SNP is given as an example.
The case/pseudocontrol method provides a convenient framework for family-based association analysis of case-parent trios, incorporating several previously proposed methods such as the transmission/disequilibrium test and log-linear modelling of parent-of-origin effects. The method allows genotype and haplotype analysis at an arbitrary number of linked and unlinked multiallelic loci, as well as modelling of more complex effects such as epistasis, parent-of-origin effects, maternal genotype and mother-child interaction effects, and gene-environment interactions. Here we extend the method for analysis of quantitative as opposed to dichotomous (e.g. disease) traits. The resulting method can be thought of as a retrospective approach, modelling genotype given trait value, in contrast to prospective approaches that model trait given genotype. Through simulations and analytical derivations, we examine the power and properties of our proposed approach, and compare it to several previously proposed single-locus methods for quantitative trait association analysis. We investigate the performance of the different methods when extended to allow analysis of haplotype, maternal genotype and parent-of-origin effects. With randomly ascertained families, with or without population stratification, the prospective approach (modeling trait value given genotype) is found to be generally most effective, although the retrospective approach has some advantages with regard to estimation and interpretability of parameter estimates when applied to selected samples. Genet. Epidemiol. 31:833, 2007. © 2007 Wiley-Liss, Inc.
family-based; regression; imprinting; TDT
For many complex diseases, quantitative traits contain more information than dichotomous traits. One of the approaches used to analyse these traits in family-based association studies is the quantitative transmission disequilibrium test (QTDT). The QTDT is a regression-based approach that models simultaneously linkage and association. It splits up the association effect in a between- and a within-family genetic component to adjust and test for population stratification and includes a variance components method to model linkage. We extend this approach to detect gene–gene interactions between two unlinked QTLs by adjusting the definition of the between- and within-family component and the variance components included in the model. We simulate data to investigate the influence of the epistasis model, linkage disequilibrium patterns between the markers and the QTLs, and allele frequencies on the power and type I error rates of the approach. Results show that for some of the investigated settings, power gains are obtained in comparison with FAM-MDR. We conclude that our approach shows promising results for candidate-gene studies where too few markers are available to correct for population stratification using standard methods (for example EIGENSTRAT). The proposed method is applied to real-life data on hypertension from the FLEMENGHO study.
QTDT; epistasis; association; linkage
We propose a new approach for the analysis of copy number variants (CNVs)for genome-wide association studies in family-based designs. Our new overall association test combines the between-family component and the within-family component of the data so that the new test statistic is fully efficient and, at the same time, achieves the complete robustness against population-admixture and stratification, as classical family-based association tests that are based only on the between-family component. Although all data are incorporated into the test statistic, an adjustment for genetic confounding is not needed, not even for the between-family component. The new test statistic is valid for testing either quantitative or dichotomous phenotypes. If external CNV data are available, the approach can also be used in completely ascertained samples. Similar to the approach by Ionita-Laza et al.(1), the proposed test statistic does not required a CNV-calling algorithm and is based directly on the CNV probe intensity data. We show, via simulation studies, that our methodology increases the power of the FBAT statistic to levels comparable to those of population-based designs. The advantages of the approach in practice are demonstrated by an application to a genome-wide association study for body mass index (BMI).
The availability of a large number of dense SNPs, high-throughput genotyping and computation methods promotes the application of family-based association tests. While most of the current family-based analyses focus only on individual traits, joint analyses of correlated traits can extract more information and potentially improve the statistical power. However, current TDT-based methods are low-powered. Here, we develop a method for tests of association for bivariate quantitative traits in families. In particular, we correct for population stratification by the use of an integration of principal component analysis and TDT. A score test statistic in the variance-components model is proposed. Extensive simulation studies indicate that the proposed method not only outperforms approaches limited to individual traits when pleiotropic effect is present, but also surpasses the power of two popular bivariate association tests termed FBAT-GEE and FBAT-PC, respectively, while correcting for population stratification. When applied to the GAW16 datasets, the proposed method successfully identifies at the genome-wide level the two SNPs that present pleiotropic effects to HDL and TG traits.
Family studies and heritability estimates provide evidence for a genetic contribution to variation in the human life span.
We conducted a genome wide association study (Affymetrix 100K SNP GeneChip) for longevity-related traits in a community-based sample. We report on 5 longevity and aging traits in up to 1345 Framingham Study participants from 330 families. Multivariable-adjusted residuals were computed using appropriate models (Cox proportional hazards, logistic, or linear regression) and the residuals from these models were used to test for association with qualifying SNPs (70, 987 autosomal SNPs with genotypic call rate ≥80%, minor allele frequency ≥10%, Hardy-Weinberg test p ≥ 0.001).
In family-based association test (FBAT) models, 8 SNPs in two regions approximately 500 kb apart on chromosome 1 (physical positions 73,091,610 and 73, 527,652) were associated with age at death (p-value < 10-5). The two sets of SNPs were in high linkage disequilibrium (minimum r2 = 0.58). The top 30 SNPs for generalized estimating equation (GEE) tests of association with age at death included rs10507486 (p = 0.0001) and rs4943794 (p = 0.0002), SNPs intronic to FOXO1A, a gene implicated in lifespan extension in animal models. FBAT models identified 7 SNPs and GEE models identified 9 SNPs associated with both age at death and morbidity-free survival at age 65 including rs2374983 near PON1. In the analysis of selected candidate genes, SNP associations (FBAT or GEE p-value < 0.01) were identified for age at death in or near the following genes: FOXO1A, GAPDH, KL, LEPR, PON1, PSEN1, SOD2, and WRN. Top ranked SNP associations in the GEE model for age at natural menopause included rs6910534 (p = 0.00003) near FOXO3a and rs3751591 (p = 0.00006) in CYP19A1. Results of all longevity phenotype-genotype associations for all autosomal SNPs are web posted at .
Longevity and aging traits are associated with SNPs on the Affymetrix 100K GeneChip. None of the associations achieved genome-wide significance. These data generate hypotheses and serve as a resource for replication as more genes and biologic pathways are proposed as contributing to longevity and healthy aging.
Missing data occur in genetic association studies for several reasons including missing family members and uncertain haplotype phase. Maximum likelihood is a commonly used approach to accommodate missing data, but it can be difficult to apply to family-based association studies, because of possible loss of robustness to confounding by population stratification. Here a novel likelihood for nuclear families is proposed, in which distinct sets of association parameters are used to model the parental genotypes and the offspring genotypes. This approach is robust to population structure when the data are complete, and has only minor loss of robustness when there are missing data. It also allows a novel conditioning step that gives valid analysis for multiple offspring in the presence of linkage. Unrelated subjects are included by regarding them as the children of two missing parents. Simulations and theory indicate similar operating characteristics to TRANSMIT, but with no bias with missing data in the presence of linkage. In comparison with FBAT and PCPH, the proposed model is slightly less robust to population structure but has greater power to detect strong effects. In comparison to APL and MITDT, the model is more robust to stratification and can accommodate sibships of any size. The methods are implemented for binary and continuous traits in software, UNPHASED, available from the author.
Conditional likelihood; Family-based association tests; Missing data; Population stratification; Transmission/disequilibrium test; Unphased genotype data
Family-based genetic association studies of related individuals provide opportunities to detect genetic variants that complement studies of unrelated individuals. Most statistical methods for family association studies for common variants are single-marker-based, which test one SNP a time. In this paper, we consider testing the effect of a SNP set, e.g., SNPs in a gene, in family studies, for both continuous and discrete traits. Specifically, we propose a Generalized Estimating Equations (GEE)-based kernel association test, a variance component-based testing method, to test for the association between a phenotype and multiple variants in a SNP set jointly using family samples. The proposed approach allows for both continuous and discrete traits, where the correlation among family members is taken into account through the use of an empirical covariance estimator. We derive the theoretical distribution of the proposed statistic under the null and develop analytical methods to calculate the p-values. We also propose an efficient resampling method for correcting for small sample size bias in family studies. The proposed method allows for easily incorporating covariates and SNP-SNP interactions. Simulation studies show that the proposed method properly controls for type-I error rates under both random and ascertained sampling schemes in family studies. We demonstrate through simulation studies that our approach has superior performance for association mapping compared to the single marker based minimum p-value GEE test for a SNP set effect over a range of scenarios. We illustrate the application of the proposed method using data from the Cleveland Family GWAS Study.
Family-based association; Generalized estimation equations; Kernel machine regression; Marginal models; Score test; Variance component
Osteoporosis is characterized by low bone mass and compromised bone structure, heritable traits that contribute to fracture risk. There have been no genome-wide association and linkage studies for these traits using high-density genotyping platforms.
We used the Affymetrix 100K SNP GeneChip marker set in the Framingham Heart Study (FHS) to examine genetic associations with ten primary quantitative traits: bone mineral density (BMD), calcaneal ultrasound, and geometric indices of the hip. To test associations with multivariable-adjusted residual trait values, we used additive generalized estimating equation (GEE) and family-based association tests (FBAT) models within each sex as well as sexes combined. We evaluated 70,987 autosomal SNPs with genotypic call rates ≥80%, HWE p ≥ 0.001, and MAF ≥10% in up to 1141 phenotyped individuals (495 men and 646 women, mean age 62.5 yrs). Variance component linkage analysis was performed using 11,200 markers.
Heritability estimates for all bone phenotypes were 30–66%. LOD scores ≥3.0 were found on chromosomes 15 (1.5 LOD confidence interval: 51,336,679–58,934,236 bp) and 22 (35,890,398–48,603,847 bp) for femoral shaft section modulus. The ten primary phenotypes had 12 associations with 100K SNPs in GEE models at p < 0.000001 and 2 associations in FBAT models at p < 0.000001. The 25 most significant p-values for GEE and FBAT were all less than 3.5 × 10-6 and 2.5 × 10-5, respectively. Of the 40 top SNPs with the greatest numbers of significantly associated BMD traits (including femoral neck, trochanter, and lumbar spine), one half to two-thirds were in or near genes that have not previously been studied for osteoporosis. Notably, pleiotropic associations between BMD and bone geometric traits were uncommon. Evidence for association (FBAT or GEE p < 0.05) was observed for several SNPs in candidate genes for osteoporosis, such as rs1801133 in MTHFR; rs1884052 and rs3778099 in ESR1; rs4988300 in LRP5; rs2189480 in VDR; rs2075555 in COLIA1; rs10519297 and rs2008691 in CYP19, as well as SNPs in PPARG (rs10510418 and rs2938392) and ANKH (rs2454873 and rs379016). All GEE, FBAT and linkage results are provided as an open-access results resource at .
The FHS 100K SNP project offers an unbiased genome-wide strategy to identify new candidate loci and to replicate previously suggested candidate genes for osteoporosis.
Recent advances in high-throughput sequencing technologies make it increasingly more efficient to sequence large cohorts for many complex traits. We discuss here a class of sequence-based association tests for family-based designs that corresponds naturally to previously proposed population-based tests, including the classical Burden and variance-component tests. This framework allows for a direct comparison between the powers of sequence-based association tests with family- vs population-based designs. We show that for dichotomous traits using family-based controls results in similar power levels as the population-based design (although at an increased sequencing cost for the family-based design), while for continuous traits (in random samples, no ascertainment) the population-based design can be substantially more powerful. A possible disadvantage of population-based designs is that they can lead to increased false-positive rates in the presence of population stratification, while the family-based designs are robust to population stratification. We show also an application to a small exome-sequencing family-based study on autism spectrum disorders. The tests are implemented in publicly available software.
family- and population-based association tests; sequence data; burden and variance-component tests
Many complex disease syndromes, such as asthma, consist of a large number of highly related, rather than independent, clinical or molecular phenotypes. This raises a new technical challenge in identifying genetic variations associated simultaneously with correlated traits. In this study, we propose a new statistical framework called graph-guided fused lasso (GFlasso) to directly and effectively incorporate the correlation structure of multiple quantitative traits such as clinical metrics and gene expressions in association analysis. Our approach represents correlation information explicitly among the quantitative traits as a quantitative trait network (QTN) and then leverages this network to encode structured regularization functions in a multivariate regression model over the genotypes and traits. The result is that the genetic markers that jointly influence subgroups of highly correlated traits can be detected jointly with high sensitivity and specificity. While most of the traditional methods examined each phenotype independently and combined the results afterwards, our approach analyzes all of the traits jointly in a single statistical framework. This allows our method to borrow information across correlated phenotypes to discover the genetic markers that perturb a subset of the correlated traits synergistically. Using simulated datasets based on the HapMap consortium and an asthma dataset, we compared the performance of our method with other methods based on single-marker analysis and regression-based methods that do not use any of the relational information in the traits. We found that our method showed an increased power in detecting causal variants affecting correlated traits. Our results showed that, when correlation patterns among traits in a QTN are considered explicitly and directly during a structured multivariate genome association analysis using our proposed methods, the power of detecting true causal SNPs with possibly pleiotropic effects increased significantly without compromising performance on non-pleiotropic SNPs.
An association study examines a phenotype against genotypic variations over a large set of individuals in order to find the genetic variant that gives rise to the variation in the phenotype. Many complex disease syndromes consist of a large number of highly related clinical phenotypes, and the patient cohorts are routinely surveyed with a large number of traits, such as hundreds of clinical phenotypes and genome-wide profiling of thousands of gene expressions, many of which are correlated. However, most of the conventional approaches for association mapping or eQTL analysis consider a single phenotype at a time instead of taking advantage of the relatedness of traits by analyzing them jointly. Assuming that a group of tightly correlated traits may share a common genetic basis, in this paper, we present a new framework for association analysis that searches for genetic variations influencing a group of correlated traits. We explicitly represent the correlation information in multiple quantitative traits as a quantitative trait network and directly incorporate this network information to scan the genome for association. Our results on simulated and asthma data show that our approach has a significant advantage in detecting associations when a genetic marker perturbs synergistically a group of traits.
Susceptibility to type 2 diabetes may be conferred by genetic variants having modest effects on risk. Genome-wide fixed marker arrays offer a novel approach to detect these variants.
We used the Affymetrix 100K SNP array in 1,087 Framingham Offspring Study family members to examine genetic associations with three diabetes-related quantitative glucose traits (fasting plasma glucose (FPG), hemoglobin A1c, 28-yr time-averaged FPG (tFPG)), three insulin traits (fasting insulin, HOMA-insulin resistance, and 0–120 min insulin sensitivity index); and with risk for diabetes. We used additive generalized estimating equations (GEE) and family-based association test (FBAT) models to test associations of SNP genotypes with sex-age-age2-adjusted residual trait values, and Cox survival models to test incident diabetes.
We found 415 SNPs associated (at p < 0.001) with at least one of the six quantitative traits in GEE, 242 in FBAT (18 overlapped with GEE for 639 non-overlapping SNPs), and 128 associated with incident diabetes (31 overlapped with the 639) giving 736 non-overlapping SNPs. Of these 736 SNPs, 439 were within 60 kb of a known gene. Additionally, 53 SNPs (of which 42 had r2 < 0.80 with each other) had p < 0.01 for incident diabetes AND (all 3 glucose traits OR all 3 insulin traits, OR 2 glucose traits and 2 insulin traits); of these, 36 overlapped with the 736 other SNPs. Of 100K SNPs, one (rs7100927) was in moderate LD (r2 = 0.50) with TCF7L2 (rs7903146), and was associated with risk of diabetes (Cox p-value 0.007, additive hazard ratio for diabetes = 1.56) and with tFPG (GEE p-value 0.03). There were no common (MAF > 1%) 100K SNPs in LD (r2 > 0.05) with ABCC8 A1369S (rs757110), KCNJ11 E23K (rs5219), or SNPs in CAPN10 or HNFa. PPARG P12A (rs1801282) was not significantly associated with diabetes or related traits.
Framingham 100K SNP data is a resource for association tests of known and novel genes with diabetes and related traits posted at . Framingham 100K data replicate the TCF7L2 association with diabetes.
Biological insights into group differences, such as disease status, have been achieved through differential co-expression analysis of microarray data. Additional understanding of group differences may be achieved by integrating the connectivity structure of the differential co-expression network and per-gene differential expression between phenotypic groups. Such a global differential co-expression network strategy may increase sensitivity to detect gene-gene interactions (or expression epistasis) that may act as candidates for rewiring susceptibility co-expression networks.
We test two methods for inferring Genetic Association Interaction Networks (GAIN) incorporating both differential co-expression effects and differential expression effects: a generalized linear model (GLM) regression method with interaction effects (reGAIN) and a Fisher test method for correlation differences (dcGAIN). We rank the importance of each gene with complete interaction network centrality (CINC), which integrates each gene’s differential co-expression effects in the GAIN model along with each gene’s individual differential expression measure. We compare these methods with statistical learning methods Relief-F, Random Forests and Lasso. We also develop a mixture model and permutation approach for determining significant importance score thresholds for network centralities, Relief-F and Random Forest. We introduce a novel simulation strategy that generates microarray case–control data with embedded differential co-expression networks and underlying correlation structure based on scale-free or Erdos-Renyi (ER) random networks.
Using the network simulation strategy, we find that Relief-F and reGAIN provide the best balance between detecting interactions and main effects, plus reGAIN has the ability to adjust for covariates and model quantitative traits. The dcGAIN approach performs best at finding differential co-expression effects by design but worst for main effects, and it does not adjust for covariates and is limited to dichotomous outcomes. When the underlying network is scale free instead of ER, all interaction network methods have greater power to find differential co-expression effects. We apply these methods to a public microarray study of the differential immune response to influenza vaccine, and we identify effects that suggest a role in influenza vaccine immune response for genes from the PI3K family, which includes genes with known immunodeficiency function, and KLRG1, which is a known marker of senescence.
Electronic supplementary material
The online version of this article (doi:10.1186/s13040-015-0040-x) contains supplementary material, which is available to authorized users.
Recent advances in high-throughput genotyping and transcript profiling technologies have enabled the inexpensive production of genome-wide dense marker maps in tandem with huge amounts of expression profiles. These large-scale data encompass valuable information about the genetic architecture of important phenotypic traits. Comprehensive models that combine molecular markers and gene transcript levels are increasingly advocated as an effective approach to dissecting the genetic architecture of complex phenotypic traits. The simultaneous utilization of marker and gene expression data to explain the variation in clinical quantitative trait, known as clinical quantitative trait locus (cQTL) mapping, poses challenges that are both conceptual and computational. Nonetheless, the hierarchical Bayesian (HB) modeling approach, in combination with modern computational tools such as Markov chain Monte Carlo (MCMC) simulation techniques, provides much versatility for cQTL analysis. Sillanpää and Noykova (2008) developed a HB model for single-trait cQTL analysis in inbred line cross-data using molecular markers, gene expressions, and marker-gene expression pairs. However, clinical traits generally relate to one another through environmental correlations and/or pleiotropy. A multi-trait approach can improve on the power to detect genetic effects and on their estimation precision. A multi-trait model also provides a framework for examining a number of biologically interesting hypotheses. In this paper we extend the HB cQTL model for inbred line crosses proposed by Sillanpää and Noykova to a multi-trait setting. We illustrate the implementation of our new model with simulated data, and evaluate the multi-trait model performance with regard to its single-trait counterpart. The data simulation process was based on the multi-trait cQTL model, assuming three traits with uncorrelated and correlated cQTL residuals, with the simulated data under uncorrelated cQTL residuals serving as our test set for comparing the performances of the multi-trait and single-trait models. The simulated data under correlated cQTL residuals were essentially used to assess how well our new model can estimate the cQTL residual covariance structure. The model fitting to the data was carried out by MCMC simulation through OpenBUGS. The multi-trait model outperformed its single-trait counterpart in identifying cQTLs, with a consistently lower false discovery rate. Moreover, the covariance matrix of cQTL residuals was typically estimated to an appreciable degree of precision under the multi-trait cQTL model, making our new model a promising approach to addressing a wide range of issues facing the analysis of correlated clinical traits.
Bayesian multilevel modeling; genetic architecture; linked marker-expression pairs; pleiotropy
Cohort studies typically sample unrelated individuals from a population, although family members of index cases may be also be recruited to investigate shared familial risk factors. Recruitment of family members may be incomplete or ancillary to the main cohort, resulting in a mixed sample of independent family units, including unrelated singletons and multiplex families. Multiple methods are available to perform genome wide association (GWA) analysis of binary or continuous traits in families, but it is unclear whether methods known to perform well on ascertained pedigrees, sib-ships, or trios are appropriate in analysis of a mixed unrelated cohort and family sample.
We present simulation studies based on Multi-Ethnic Study of Atherosclerosis (MESA) pedigree structures to compare the performance of several popular methods of GWA analysis for both quantitative and dichotomous traits in cohort studies. We evaluate approaches suitable for analysis of families, and combined the best performing methods with population-based samples either by meta-analysis, or by pooled analysis of family- and population-based samples (mega-analysis), comparing type 1 error and power. We further assess practical considerations, such as availability of software and ability to incorporate covariates in statistical modeling, and demonstrate our recommended approaches through quantitative and binary trait analysis of HDL cholesterol (HDL-C) in 2,553 MESA family- and population-based African-American samples. Our results suggest linear modeling approaches that accommodate family-induced phenotypic correlation (e.g., variance component model for quantitative traits or generalized estimating equations for dichotomous traits) perform best in the context of combined family- and population-based cohort GWAS.
genome-wide association study (GWAS); cohort study; simulation study; generalized estimating equations (GEE); variance component model; family-based association
We used the FBAT (family-based association test) software to test for association between 300 individual single-nucleotide polymorphisms and P1 (a latent trait of Kofendred Personality Disorder) in 100 simulated replicates of the Aipotu population. Using the Genetic Analysis Workshop 14 dataset, we calculated the power of FBAT to detect linkage disequilibrium on chromosome 3 (D2). Also, we calculated the false-positive rate on chromosome 1, which contains a true locus (D1) but no linkage disequilibrium was simulated between the trait and all the surrounding single-nucleotide polymorphisms.
We were able to detect the associations between phenotype P1 and three adjacent markers B03T3056 (average p-value = 0.0002), B03T3057 (average p-value = 0.00072), and B03T3058 (average p-value = 0.0038) with power of 98%, 87%, 71% on chromosome 3, respectively. The overall false positve rate to detect association was 0.06 on chromosome 1.
The power to detect a significant association in 100 nuclear families affected with the latent trait of Kofendred Personality Disorder by using FBAT was reasonable (based on 100 replicates). In the future, we will compare the performance of FBAT with alternative approaches, such as using FBAT-generalized estimating equations methods to test for association in families affected with complex traits.
Genome-wide association studies raise study-design and analytical issues that are still being debated. Among them, stands the issue of reducing the number of markers to be genotyped without loss of efficiency in identifying trait loci, which can reduce the cost of studies and minimize the multiple testing problem. With this aim, we proposed a two-step strategy based on two analytical methods suited to examine sets of markers rather than single markers: the local score, which screens the genome to select candidate regions in Step 1, and FBAT-LC, a multiple-marker family-based association test used to obtain significance levels of regions at step 2. The performance of this strategy was evaluated on all replicates of Genetic Analysis Workshop 15 Problem 3 simulated data, using the answers to that problem. Overall, seven of the nine generated trait loci were detected in at least 87% of the replicates using a framework designed to handle either association with the disease or association with the severity of disease. This multiple-marker strategy was compared to the single-marker approach. By considering regions instead of single markers, this strategy minimizes the multiple testing problem and the number of false-positive results.
Glomerular filtration rate (GFR) and urinary albumin excretion (UAE) are markers of kidney function that are known to be heritable. Many endocrine conditions have strong familial components. We tested for association between the Affymetrix GeneChip Human Mapping 100K single nucleotide polymorphism (SNP) set and measures of kidney function and endocrine traits.
Genotype information on the Affymetrix GeneChip Human Mapping 100K SNP set was available on 1345 participants. Serum creatinine and cystatin-C (cysC; n = 981) were measured at the seventh examination cycle (1998–2001); GFR (n = 1010) was estimated via the Modification of Diet in Renal Disease (MDRD) equation; UAE was measured on spot urine samples during the sixth examination cycle (1995–1998) and was indexed to urinary creatinine (n = 822). Thyroid stimulating hormone (TSH) was measured at the third and fourth examination cycles (1981–1984; 1984–1987) and mean value of the measurements were used (n = 810). Age-sex-adjusted and multivariable-adjusted residuals for these measurements were used in association with genotype data using generalized estimating equations (GEE) and family-based association tests (FBAT) models. We presented the results for association tests using additive allele model. We evaluated associations with 70,987 SNPs on autosomes with minor allele frequencies of at least 0.10, Hardy-Weinberg Equilibrium p-value ≥ 0.001, and call rates of at least 80%.
The top SNPs associated with these traits using the GEE method were rs2839235 with GFR (p-value 1.6*10-05), rs1158167 with cysC (p-value 8.5*10-09), rs1712790 with UAE (p-value 1.9*10-06), and rs6977660 with TSH (p-value 3.7*10-06), respectively. The top SNPs associated with these traits using the FBAT method were rs6434804 with GFR(p-value 2.4*10-5), rs563754 with cysC (p-value 4.7*10-5), rs1243400 with UAE (p-value 4.8*10-6), and rs4128956 with TSH (p-value 3.6*10-5), respectively. Detailed association test results can be found at . Four SNPs in or near the CST3 gene were highly associated with cysC levels (p-value 8.5*10-09 to 0.007).
Kidney function traits and TSH are associated with SNPs on the Affymetrix GeneChip Human Mapping 100K SNP set. These data will serve as a valuable resource for replication as more SNPs associated with kidney function and endocrine traits are identified.
Several family-based approaches have been previously proposed to enhance the power for testing genetic association when the traits are measured longitudinally or repeatedly. In this paper, we show that some of these FBAT approaches can be easily extended to accommodate incomplete data and remain unbiased tests. We also show that because of the nature of FBAT approaches, we can impute the missing phenotypes without biasing our tests and achieve higher power. We propose two imputation techniques based on E-M algorithm and the conditional mean model, respectively. Through simulation studies, these two imputation techniques are shown to have correct false positive rate and generally achieve higher power than complete case analysis or simple mean-imputation. Application of these approaches for testing an association between Body Mass Index and a previously reported candidate SNP confirms our results.
FBAT; Longitudinal Phenotype; Missing Data
The revolution in next-generation sequencing has made obtaining both common and rare high-quality sequence variants across the entire genome feasible. Because researchers are now faced with the analytical challenges of handling a massive amount of genetic variant information from sequencing studies, numerous methods have been developed to assess the impact of both common and rare variants on disease traits. In this report, whole genome sequencing data from Genetic Analysis Workshop 18 was used to compare the power of several methods, considering both family-based and population-based designs, to detect association with variants in the MAP4 gene region and on chromosome 3 with blood pressure. To prioritize variants across the genome for testing, variants were first functionally assessed using prediction algorithms and expression quantitative trait loci (eQTLs) data. Four set-based tests in the family-based association tests (FBAT) framework--FBAT-v, FBAT-lmm, FBAT-m, and FBAT-l--were used to analyze 20 pedigrees, and 2 variance component tests, sequence kernel association test (SKAT) and genome-wide complex trait analysis (GCTA), were used with 142 unrelated individuals in the sample. Both set-based and variance-component-based tests had high power and an adequate type I error rate. Of the various FBATs, FBAT-l demonstrated superior performance, indicating the potential for it to be used in rare-variant analysis. The updated FBAT package is available at: http://www.hsph.harvard.edu/fbat/.
Genome-wide association (GWA) studies that use population-based association approaches may identify spurious associations in the presence of population admixture. In this paper, we propose a novel three-stage approach that is computationally efficient and robust to population admixture and more powerful than the family-based association test (FBAT) for GWA studies with family data.
We propose a three-stage approach for GWA studies with family data. The first stage is to perform linear regression ignoring phenotypic correlations among family members. SNPs with a first stage p-value below a liberal cut-off (e.g. 0.1) are then analyzed in the second stage that employs a linear mixed effects (LME) model that accounts for within family correlations. Next, SNPs that reach genome-wide significance (e.g. 10-6 for 34,625 genotyped SNPs in this paper) are analyzed in the third stage using FBAT, with correction of multiple testing only for SNPs that enter the third stage. Simulations are performed to evaluate type I error and power of the proposed method compared to LME adjusting for 10 principal components (PC) of the genotype data. We also apply the three-stage approach to the GWA analyses of uric acid in Framingham Heart Study's SNP Health Association Resource (SHARe) project.
Our simulations show that whether or not population admixture is present, the three-stage approach has no inflated type I error. In terms of power, using LME adjusting PC is only slightly more powerful than the three-stage approach. When applied to the GWA analyses of uric acid in the SHARe project of FHS, the three-stage approach successfully identified and confirmed three SNPs previously reported as genome-wide significant signals.
For GWA analyses of quantitative traits with family data, our three-stage approach provides another appealing solution to population admixture, in addition to LME adjusting for genetic PC.