Four custom Axiom genotyping arrays were designed for a genome-wide association (GWA) study of 100,000 participants from the Kaiser Permanente Research Program on Genes, Environment and Health. The array optimized for individuals of European race/ethnicity was previously described. Here we detail the development of three additional microarrays optimized for individuals of East Asian, African American, and Latino race/ethnicity. For these arrays, we decreased redundancy of high-performing SNPs to increase SNP capacity. The East Asian array was designed using greedy pairwise SNP selection. However, removing SNPs from the target set based on imputation coverage is more efficient than pairwise tagging. Therefore, we developed a novel hybrid SNP selection method for the African American and Latino arrays utilizing rounds of greedy pairwise SNP selection, followed by removal from the target set of SNPs covered by imputation. The arrays provide excellent genome-wide coverage and are valuable additions for large-scale GWA studies.
Microarray; Genome-wide association study; Coverage; Imputation; Single nucleotide polymorphism; Throughput
It is useful to have robust gene-environment interaction tests that can utilize a variety of family structures in an efficient way. This paper focuses on tests for gene-environment interaction in the presence of main genetic and environmental effects. The objective is to develop powerful tests that can combine trio data with parental genotypes and discordant sibships when parents genotypes are missing. We first make a modest improvement on a method for discordant sibs (discordant on phenotype), but the approach does not allow one to use families when all offspring are affected, e.g. trios. We then make a modest improvement on a Mendelian transmission-based approach that is inefficient when discordant sibs are available, but can be applied to any nuclear family. Finally, we propose a hybrid approach that utilizes the most efficient method for a specific family type, then combines over families. We utilize this hybrid approach to analyze a chronic obstructive pulmonary disorder dataset to test for gene-environment interaction in the Serpine2 gene with smoking. The methods are freely available in the R package fbati.
Gene-Environment Interaction; Family-Based Association Tests; Candidate Gene Analysis; Binary Trait; COPD; Serpine2
The success of genome-wide association studies has paralleled the development of efficient genotyping technologies. We describe the development of a next-generation microarray based on the new highly-efficient Affymetrix Axiom genotyping technology that we are using to genotype individuals of European ancestry from the Kaiser Permanente Research Program on Genes, Environment and Health (RPGEH). The array contains 674,517 SNPs, and provides excellent genome-wide as well as gene-based and candidate-SNP coverage. Coverage was calculated using an approach based on imputation and cross validation. Preliminary results for the first 80,301 saliva-derived DNA samples from the RPGEH demonstrate very high quality genotypes, with sample success rates above 94% and over 98% of successful samples having SNP call rates exceeding 98%. At steady state, we have produced 462 million genotypes per week for each Axiom system. The new array provides a valuable addition to the repertoire of tools for large scale genome-wide association studies.
Microarray; Genome-wide association study; Coverage; Throughput; Single nucleotide polymorphism
Genome-wide association studies (GWAS) have successfully detected and replicated associations with numerous diseases, including cancers of the prostate and breast. These findings are helping clarify the genomic basis of such diseases, but appear to explain little of disease heritability. This limitation might reflect the focus of conventional GWAS on a small set of the most statistically significant associations with disease. More information might be obtained by analyzing GWAS using a polygenic model, which allows for the possibility that thousands of genetic variants could impact disease. Furthermore, there may exist common polygenic effects between potentially related phenotypes (e.g., prostate and breast cancer). Here we present and apply a polygenic model to GWAS of prostate and breast cancer. Our results indicate that the polygenic model can explain an increasing—albeit low—amount of heritability for both of these cancers, even when excluding the most statistically significant associations. In addition, nonaggressive prostate cancer and breast cancer appear to share a common polygenic model, potentially reflecting a similar underlying biology. This supports the further development and application of polygenic models to genomic data.
Despite compelling epidemiological evidence that folic acid supplements reduce the frequency of neural tube defects (NTDs) in newborns, common variant association studies with folate metabolism genes have failed to explain the majority of NTD risk. The contribution of rare alleles as well as genetic interactions within the folate pathway have not been extensively studied in the context of NTDs. Thus, we sequenced the exons in 31 folate-related genes in a 480-member NTD case-control population to identify the full spectrum of allelic variation and determine whether rare alleles or obvious genetic interactions within this pathway affect NTD risk. We constructed a pathway model, predetermined independent of the data, which grouped genes into coherent sets reflecting the distinct metabolic compartments in the folate/one-carbon pathway (purine synthesis, pyrimidine synthesis, and homocysteine recycling to methionine). By integrating multiple variants based on these groupings, we uncovered two provocative, complex genetic risk signatures. Interestingly, these signatures differed by race/ethnicity: a Hispanic risk profile pointed to alterations in purine biosynthesis, whereas that in non-Hispanic whites implicated homocysteine metabolism. In contrast, parallel analyses that focused on individual alleles, or individual genes, as the units by which to assign risk revealed no compelling associations. These results suggest that the ability to layer pathway relationships onto clinical variant data can be uniquely informative for identifying genetic risk as well as for generating mechanistic hypotheses. Furthermore, the identification of ethnic-specific risk signatures for spina bifida resonated with epidemiological data suggesting that the underlying pathogenesis may differ between Hispanic and non-Hispanic groups.
Rare variants may help to explain some of the missing heritability of complex diseases. Technological advances in next-generation sequencing give us the opportunity to test this hypothesis. We propose two new methods (one for case-control studies and one for family-based studies) that combine aggregated rare variants and common variants located within a region through principal components analysis and allow for covariate adjustment. We analyzed 200 replicates consisting of 209 case subjects and 488 control subjects and compared the results to weight-based and step-up aggregation methods. The principal components and collapsing method showed an association between the gene FLT1 and the quantitative trait Q1 (P<10−30) in a fraction of the computation time of the other methods. The proposed family-based test has inconclusive results. The two methods provide a fast way to analyze simultaneously rare and common variants at the gene level while adjusting for covariates. However, further evaluation of the statistical efficiency of this approach is warranted.
R package is designed for developers of
R packages, to help rapidly, and sometimes fully automatically, create a graphical user interface for a command line
R package. The interface is built upon the
Tcl/Tk graphical interface included in
R. The package further facilitates the developer by loading in the help files from the command line functions to provide context sensitive help to the user with no additional effort from the developer. Passing a function as the argument to the routines in the fgui package creates a graphical interface for the function, and further options are available to tweak this interface for those who want more flexibility.
GUI; interface; fgui
When testing for genetic effects, failure to account for a gene-environment interaction can mask the true association effects of a genetic marker with disease. Family-based association tests are popular because they are completely robust to population substructure and model misspecification. However, when testing for an interaction, failure to model the main genetic effect correctly can lead to spurious results. Here we propose a family-based test for interaction that is robust to model misspecification, but still sensitive to an interaction effect, and can handle continuous covariates and missing parents. We extend the FBAT-I gene-environment interaction test for dichotomous traits to using both trios and sibships. We then compare this extension to joint tests of gene and gene-environment interaction, and compare the joint test additionally to the main effects test of the gene. Lastly we apply these three tests to a group of nuclear families ascertained according to affection with Bipolar Disorder.
genetic association; genetic interaction; family-based test; FBAT-I
With the advent of high throughput genomics and high-resolution imaging techniques, there is a growing necessity in biology and medicine for parallel computing, and with the low cost of computing, it is now cost-effective for even small labs or individuals to build their own personal computation cluster.
Here we briefly describe how to use commodity hardware to build a low-cost, high-performance compute cluster, and provide an in-depth example and sample code for parallel execution of R jobs using MOSIX, a mature extension of the Linux kernel for parallel computing. A similar process can be used with other cluster platform software.
As a statistical genetics example, we use our cluster to run a simulated eQTL experiment. Because eQTL is computationally intensive, and is conceptually easy to parallelize, like many statistics/genetics applications, parallel execution with MOSIX gives a linear speedup in analysis time with little additional effort.
We have used MOSIX to run a wide variety of software programs in parallel with good results. The limitations and benefits of using MOSIX are discussed and compared to other platforms.
We introduce a stepwise approach for family-based designs for selecting a set of markers in a gene that are independently associated with the disease. The approach is based on testing the effect of a set of markers conditional on another set of markers. Several likelihood-based approaches have been proposed for special cases, but no model-free based tests have been proposed. We propose two types of tests in a family-based framework that are applicable to arbitrary family structures and completely robust to population stratification. We propose methods for ascertained dichotomous traits and unascertained quantitative traits. We first propose a completely model-free extension of the FBAT main genetic effect test. Then, for power issues, we introduce two model-based tests, one for dichotomous traits and one for continuous traits. Lastly, we utilize these tests to analyze a continuous lung function phenotype as a proxy for asthma in the Childhood Asthma Management Program. The methods are implemented in the free R package fbati.
Binary trait; Candidate gene analysis; Family-based association tests; FBAT-C; Linkage disequilibrium (LD); Model-based test; Model-free test; Nuclear families; Quantitative trait
Recent findings suggest that rare variants play an important role in both monogenic and common diseases. Due to their rarity, however, it remains unclear how to appropriately analyze the association between such variants and disease. A common approach entails combining rare variants together based on a priori information and analyzing them as a single group. Here one must make some assumptions about what to aggregate. Instead, we propose two approaches to empirically determine the most efficient grouping of rare variants. The first considers multiple possible groupings using existing information. The second is an agnostic “step-up” approach that determines an optimal grouping of rare variants analytically and does not rely on prior information. To evaluate these approaches, we undertook a simulation study using sequence data from genes in the one-carbon folate metabolic pathway. Our results show that using prior information to group rare variants is advantageous only when information is quite accurate, but the step-up approach works well across a broad range of plausible scenarios. This agnostic approach allows one to efficiently analyze the association between rare variants and disease while avoiding assumptions required by other approaches for grouping such variants.
Numerous studies have demonstrated associations between genetic markers and COPD, but results have been inconsistent. One reason may be heterogeneity in disease definition. Unsupervised learning approaches may assist in understanding disease heterogeneity.
We selected 31 phenotypic variables and 12 SNPs from five candidate genes in 308 subjects in the National Emphysema Treatment Trial (NETT) Genetics Ancillary Study cohort. We used factor analysis to select a subset of phenotypic variables, and then used cluster analysis to identify subtypes of severe emphysema. We examined the phenotypic and genotypic characteristics of each cluster.
We identified six factors accounting for 75% of the shared variability among our initial phenotypic variables. We selected four phenotypic variables from these factors for cluster analysis: 1) post-bronchodilator FEV1 percent predicted, 2) percent bronchodilator responsiveness, and quantitative CT measurements of 3) apical emphysema and 4) airway wall thickness. K-means cluster analysis revealed four clusters, though separation between clusters was modest: 1) emphysema predominant, 2) bronchodilator responsive, with higher FEV1; 3) discordant, with a lower FEV1 despite less severe emphysema and lower airway wall thickness, and 4) airway predominant. Of the genotypes examined, membership in cluster 1 (emphysema-predominant) was associated with TGFB1 SNP rs1800470.
Cluster analysis may identify meaningful disease subtypes and/or groups of related phenotypic variables even in a highly selected group of severe emphysema subjects, and may be useful for genetic association studies.
Berkeley sickle cell mice are used as an animal model of human sickle cell disease but there are no reports of platelet studies in this model. Since humans with sickle cell disease have platelet abnormalities, we studied platelet morphology and function in Berkeley mice (SS). We observed elevated mean platelet forward angle light scatter (FSC) values (an indirect measure of platelet volume) in SS compared to wild type (WT) (37 ± 3.2 vs. 27 ± 1.4, mean ± SD; p <0.001), in association with moderate thrombocytopenia (505 ± 49 × 103/μl vs. 1151 ± 162 × 103/μl; p <0.001). Despite having marked splenomegaly, SS mice had elevated levels of Howell-Jolly bodies and “pocked” erythrocytes (p <0.001 for both) suggesting splenic dysfunction. SS mice also had elevated numbers of thiazole orange positive platelets (5 ± 1 % vs. 1 ± 1%; p <0.001), normal to low plasma thrombopoietin levels, normal plasma glycocalicin levels, normal levels of platelet recovery, and near normal platelet life spans. Platelets from SS mice bound more fibrinogen and antibody to P-selectin following activation with a threshold concentration of a protease activated receptor (PAR)-4 peptide compared to WT mice. Enlarged platelets are associated with a predisposition to arterial thrombosis in humans and some humans with SCD have been reported to have large platelets. Thus, additional studies are needed to assess whether large platelets contribute either to pulmonary hypertension or the large vessel arterial occlusion that produces stroke in some children with sickle cell disease.
Sickle cell; Berkeley mouse model; Platelet size