It is often useful to rerun a command line R script with some slight change in the parameters used to run it – a new set of parameters for a simulation, a different dataset to process, etc. The R package batch provides a means to pass in multiple command line options, including vectors of values in the usual R format, easily into R. The same script can be setup to run things in parallel via different command line arguments. The R package batch also provides a means to simplify this parallel batching by allowing one to use R and an R-like syntax for arguments to spread a script across a cluster or local multicore/multiprocessor computer, with automated syntax for several popular cluster types. Finally it provides a means to aggregate the results together of multiple processes run on a cluster.
parallel; cluster; command line arguments; batch; R
An efficient approach to characterizing the disease burden of rare genetic variants is to impute them into large well-phenotyped cohorts with existing genome-wide genotype data using large sequenced referenced panels. The success of this approach hinges on the accuracy of rare variant imputation, which remains controversial. For example, a recent study suggested that one cannot adequately impute the HOXB13 G84E mutation associated with prostate cancer risk (carrier frequency of 0.0034 in European ancestry participants in the 1000 Genomes Project). We show here that—by utilizing the 1000 Genomes Project data plus an enriched reference panel of mutation carriers—we were able to accurately impute the G84E mutation into a large cohort of 83,285 non-Hispanic White participants from the Kaiser Permanente Research Program on Genes, Environment and Health Genetic Epidemiology Research on Adult Health and Aging cohort. Imputation authenticity was confirmed via a novel classification and regression tree method, and then empirically validated analyzing a subset of these subjects plus an additional 1,789 men from the California Men’s Health Study specifically genotyped for the G84E mutation (r2 = 0.57, 95% CI = 0.37–0.77). We then show the value of this approach by using the imputed data to investigate the impact of the G84E mutation on age-specific prostate cancer risk and on risk of fourteen other cancers in the cohort. The age-specific risk of prostate cancer among G84E mutation carriers was higher than among non-carriers, and this difference increased with age. Risk estimates from Kaplan-Meier curves were 36.7% versus 13.6% by age 72, and 64.2% versus 24.2% by age 80, for G84E mutation carriers and non-carriers, respectively (p = 3.4×10−12). The G84E mutation was also suggestively associated with an increase in risk for the following cancer sites by approximately 50% in a pleiotropic manner: breast, non-Hodgkin’s lymphoma, kidney, bladder, melanoma, endometrium, and pancreas (p = 0.042).
An efficient approach to characterizing the disease burden of rare genetic variants is to impute them into existing well-phenotyped cohorts with genome-wide data by using large sequenced reference panels; however, the efficacy of this approach remains controversial. A recent study suggested that it is not possible to impute the rare HOXB13 G84E variant using neighboring SNP markers. We show that by using an enriched reference sequenced sample of 22 mutation carriers, we were able to impute this mutation into a large cohort of 83,285 non-Hispanic White individuals from the Kaiser Permanente Research Program on Genes, Environment, and Health Genetic Epidemiology Research on Adult Health and Aging (GERA) cohort. The imputation was confirmed via a novel classification and regression tree method, and then empirically validated by direct mutation genotyping of a subset of 1,673 of these individuals in addition to 1,789 other men from Kaiser. Using the same GERA cohort, we then confirmed that the G84E mutation is associated with increased risk of prostate cancer, and estimated the age-specific risk for carriers of the mutation. Finally, we obtained evidence that the mutation is associated with additional types of cancer in the GERA cohort.
Inter-individual variation in gene regulatory elements is hypothesized to play a causative role in adverse drug reactions and reduced drug activity. However, relatively little is known about the location and function of drug-dependent elements. To uncover drug-associated elements in a genome-wide manner, we performed RNA-seq and ChIP-seq using antibodies against the pregnane X receptor (PXR) and three active regulatory marks (p300, H3K4me1, H3K27ac) on primary human hepatocytes treated with rifampin or vehicle control. Rifampin and PXR were chosen since they are part of the CYP3A4 pathway, which is known to account for the metabolism of more than 50% of all prescribed drugs. We selected 227 proximal promoters for genes with rifampin-dependent expression or nearby PXR/p300 occupancy sites and assayed their ability to induce luciferase in rifampin-treated HepG2 cells, finding only 10 (4.4%) that exhibited drug-dependent activity. As this result suggested a role for distal enhancer modules, we searched more broadly to identify 1,297 genomic regions bearing a conditional PXR occupancy as well as all three active regulatory marks. These regions are enriched near genes that function in the metabolism of xenobiotics, specifically members of the cytochrome P450 family. We performed enhancer assays in rifampin-treated HepG2 cells for 42 of these sequences as well as 7 sequences that overlap linkage-disequilibrium blocks defined by lead SNPs from pharmacogenomic GWAS studies, revealing 15/42 and 4/7 to be functional enhancers, respectively. A common African haplotype in one of these enhancers in the GSTA locus was found to exhibit potential rifampin hypersensitivity. Combined, our results further suggest that enhancers are the predominant targets of rifampin-induced PXR activation, provide a genome-wide catalog of PXR targets and serve as a model for the identification of drug-responsive regulatory elements.
Drug response varies between individuals and can be caused by genetic factors. Nucleotide variation in gene regulatory elements can have a significant effect on drug response, but due to the difficulty in identifying these elements, they remain understudied. Here, we used various genomic assays to analyze human liver cells treated with or without the antibiotic rifampin and identified drug-induced regulatory elements genome-wide. The testing of numerous active promoters in human liver cells showed only a few to be induced by rifampin treatment. A similar analysis of enhancers found several of them to be induced by the drug. Nucleotide variants in one of these enhancers were found to alter its activity. Combined, this work identifies numerous novel gene regulatory elements that can be activated due to drug response and thus provides candidate sequences in the human genome where nucleotide variation can lead to differences in drug response. It also provides a universally applicable method to detect these elements for other drugs.
Twin studies suggest that heritability of moderate-severe bronchopulmonary dysplasia (BPD) is 53% to 79%, we conducted a genome-wide association study (GWAS) to identify genetic variants associated with the risk for BPD.
The discovery GWAS was completed on 1726 very low birth weight infants (gestational age = 250–296/7 weeks) who had a minimum of 3 days of intermittent positive pressure ventilation and were in the hospital at 36 weeks’ postmenstrual age. At 36 weeks’ postmenstrual age, moderate-severe BPD cases (n = 899) were defined as requiring continuous supplemental oxygen, whereas controls (n = 827) inhaled room air. An additional 795 comparable infants (371 cases, 424 controls) were a replication population. Genomic DNA from case and control newborn screening bloodspots was used for the GWAS. The replication study interrogated single-nucleotide polymorphisms (SNPs) identified in the discovery GWAS and those within the HumanExome beadchip.
Genotyping using genomic DNA was successful. We did not identify SNPs associated with BPD at the genome-wide significance level (5 × 10−8) and no SNP identified in previous studies reached statistical significance (Bonferroni-corrected P value threshold .0018). Pathway analyses were not informative.
We did not identify genomic loci or pathways that account for the previously described heritability for BPD. Potential explanations include causal mutations that are genetic variants and were not assayed or are mapped to many distributed loci, inadequate sample size, race ethnicity of our study population, or case-control differences investigated are not attributable to underlying common genetic variation.
genome-wide association study (GWAS); chronic lung disease; genetic predisposition to disease; premature; very low birth weight infant
Adiponectin, a protein secreted by the adipose tissue, is an endogenous insulin sensitizer with circulating levels that are decreased in obese and diabetic subjects. Recently, circulating levels of adiponectin have been correlated with breast cancer risk. Our previous work showed that polymorphisms of the adiponectin pathway are associated with breast cancer risk.
We conducted the first study of adiponectin pathways in African Americans and Hispanics in the Women’s Health Initiative (WHI) SNP Health Association Resource (SHARe) cohort of 3,642 self-identified Hispanic women and 8,515 self-identified African American women who provided consent for DNA analysis. Single nucleotide polymorphisms (SNPs) from three genes were included in this analysis: ADIPOQ, ADIPOR1 and ADIPOR2. The Genome-wide Human SNP Array 6.0 (909,622 SNPs) (www.affymetrix.com) was used.
We found that rs1501299, a functional SNP of ADIPOQ that we previously reported was associated with breast cancer risk in a mostly Caucasian population, was also significantly associated with breast cancer incidence (HR for the GG/TG genotype: 1.23; 95% CI: 1.059–1.43) in African American women. We did not find any other SNPs in these genes to be associated with breast cancer incidence.
This is the first study assessing the role of adiponectin pathway SNPs in breast cancer risk in African Americans and Hispanics. RS1501299 is significantly associated with breast cancer risk in African American women. Impact: As the rates of obesity and diabetes increase in African Americans and Hispanics, adiponectin and its functional SNPs may aid in breast cancer risk assessment.
adiponectin; polymorphisms; breast cancer; African Americans; Hispanics
Electrical stimulation of the vagus nerve at relatively high voltages (e.g., >10V) can induce bronchoconstriction. However, low voltage (≤2V) vagus nerve stimulation (VNS) can attenuate histamine-invoked bronchoconstriction. Here, we identify the mechanism for this inhibition.
In urethanea-nesthetized guinea pigs, bipolar electrodes were attached to both vagus nerves and changes in pulmonary inflation pressure were recorded in response to i.v. histamine and during VNS. The attenuation of the histamine response by low-voltage VNS was then examined in the presence of pharmacologic inhibitors or nerve ligation.
Low-voltage VNS attenuated histamine-induced bronchoconstriction (4.4 ± 0.3 vs. 3.2 ± 0.2 cm H2O, p < 0.01) and remained effective following administration of a nitric oxide synthase inhibitor, NG-nitro-L-arginine methyl ester, and after sympathetic nerve depletion with guanethidine, but not after the β-adrenoceptor antagonist propranolol. Nerve ligation caudal to the electrodes did not block the inhibition but cephalic nerve ligation did. Low-voltage VNS increased circulating epinephrine and norepinephrine without but not with cephalic nerve ligation.
These results indicate that low-voltage VNS attenuates histamine-induced bronchoconstriction via activation of afferent nerves, resulting in a systemic increase in catecholamines likely arising from the adrenal medulla.
Asthma; bronchoconstriction; catecholamine; guinea pig; vagus nerve stimulation
We introduce a stepwise approach for family-based designs for selecting a set of markers in a gene that are independently associated with the disease. The approach is based on testing the effect of a set of markers conditional on another set of markers. Several likelihood-based approaches have been proposed for special cases, but no model-free based tests have been proposed. We propose two types of tests in a family-based framework that are applicable to arbitrary family structures and completely robust to population stratification. We propose methods for ascertained dichotomous traits and unascertained quantitative traits. We first propose a completely model-free extension of the FBAT main genetic effect test. Then, for power issues, we introduce two model-based tests, one for dichotomous traits and one for continuous traits. Lastly, we utilize these tests to analyze a continuous lung function phenotype as a proxy for asthma in the Childhood Asthma Management Program. The methods are implemented in the free R package fbati.
Binary trait; Candidate gene analysis; Family-based association tests; FBAT-C; Linkage disequilibrium (LD); Model-based test; Model-free test; Nuclear families; Quantitative trait
Four custom Axiom genotyping arrays were designed for a genome-wide association (GWA) study of 100,000 participants from the Kaiser Permanente Research Program on Genes, Environment and Health. The array optimized for individuals of European race/ethnicity was previously described. Here we detail the development of three additional microarrays optimized for individuals of East Asian, African American, and Latino race/ethnicity. For these arrays, we decreased redundancy of high-performing SNPs to increase SNP capacity. The East Asian array was designed using greedy pairwise SNP selection. However, removing SNPs from the target set based on imputation coverage is more efficient than pairwise tagging. Therefore, we developed a novel hybrid SNP selection method for the African American and Latino arrays utilizing rounds of greedy pairwise SNP selection, followed by removal from the target set of SNPs covered by imputation. The arrays provide excellent genome-wide coverage and are valuable additions for large-scale GWA studies.
Microarray; Genome-wide association study; Coverage; Imputation; Single nucleotide polymorphism; Throughput
It is useful to have robust gene-environment interaction tests that can utilize a variety of family structures in an efficient way. This paper focuses on tests for gene-environment interaction in the presence of main genetic and environmental effects. The objective is to develop powerful tests that can combine trio data with parental genotypes and discordant sibships when parents genotypes are missing. We first make a modest improvement on a method for discordant sibs (discordant on phenotype), but the approach does not allow one to use families when all offspring are affected, e.g. trios. We then make a modest improvement on a Mendelian transmission-based approach that is inefficient when discordant sibs are available, but can be applied to any nuclear family. Finally, we propose a hybrid approach that utilizes the most efficient method for a specific family type, then combines over families. We utilize this hybrid approach to analyze a chronic obstructive pulmonary disorder dataset to test for gene-environment interaction in the Serpine2 gene with smoking. The methods are freely available in the R package fbati.
Gene-Environment Interaction; Family-Based Association Tests; Candidate Gene Analysis; Binary Trait; COPD; Serpine2
The success of genome-wide association studies has paralleled the development of efficient genotyping technologies. We describe the development of a next-generation microarray based on the new highly-efficient Affymetrix Axiom genotyping technology that we are using to genotype individuals of European ancestry from the Kaiser Permanente Research Program on Genes, Environment and Health (RPGEH). The array contains 674,517 SNPs, and provides excellent genome-wide as well as gene-based and candidate-SNP coverage. Coverage was calculated using an approach based on imputation and cross validation. Preliminary results for the first 80,301 saliva-derived DNA samples from the RPGEH demonstrate very high quality genotypes, with sample success rates above 94% and over 98% of successful samples having SNP call rates exceeding 98%. At steady state, we have produced 462 million genotypes per week for each Axiom system. The new array provides a valuable addition to the repertoire of tools for large scale genome-wide association studies.
Microarray; Genome-wide association study; Coverage; Throughput; Single nucleotide polymorphism
Genome-wide association studies (GWAS) have successfully detected and replicated associations with numerous diseases, including cancers of the prostate and breast. These findings are helping clarify the genomic basis of such diseases, but appear to explain little of disease heritability. This limitation might reflect the focus of conventional GWAS on a small set of the most statistically significant associations with disease. More information might be obtained by analyzing GWAS using a polygenic model, which allows for the possibility that thousands of genetic variants could impact disease. Furthermore, there may exist common polygenic effects between potentially related phenotypes (e.g., prostate and breast cancer). Here we present and apply a polygenic model to GWAS of prostate and breast cancer. Our results indicate that the polygenic model can explain an increasing—albeit low—amount of heritability for both of these cancers, even when excluding the most statistically significant associations. In addition, nonaggressive prostate cancer and breast cancer appear to share a common polygenic model, potentially reflecting a similar underlying biology. This supports the further development and application of polygenic models to genomic data.
Despite compelling epidemiological evidence that folic acid supplements reduce the frequency of neural tube defects (NTDs) in newborns, common variant association studies with folate metabolism genes have failed to explain the majority of NTD risk. The contribution of rare alleles as well as genetic interactions within the folate pathway have not been extensively studied in the context of NTDs. Thus, we sequenced the exons in 31 folate-related genes in a 480-member NTD case-control population to identify the full spectrum of allelic variation and determine whether rare alleles or obvious genetic interactions within this pathway affect NTD risk. We constructed a pathway model, predetermined independent of the data, which grouped genes into coherent sets reflecting the distinct metabolic compartments in the folate/one-carbon pathway (purine synthesis, pyrimidine synthesis, and homocysteine recycling to methionine). By integrating multiple variants based on these groupings, we uncovered two provocative, complex genetic risk signatures. Interestingly, these signatures differed by race/ethnicity: a Hispanic risk profile pointed to alterations in purine biosynthesis, whereas that in non-Hispanic whites implicated homocysteine metabolism. In contrast, parallel analyses that focused on individual alleles, or individual genes, as the units by which to assign risk revealed no compelling associations. These results suggest that the ability to layer pathway relationships onto clinical variant data can be uniquely informative for identifying genetic risk as well as for generating mechanistic hypotheses. Furthermore, the identification of ethnic-specific risk signatures for spina bifida resonated with epidemiological data suggesting that the underlying pathogenesis may differ between Hispanic and non-Hispanic groups.
Rare variants may help to explain some of the missing heritability of complex diseases. Technological advances in next-generation sequencing give us the opportunity to test this hypothesis. We propose two new methods (one for case-control studies and one for family-based studies) that combine aggregated rare variants and common variants located within a region through principal components analysis and allow for covariate adjustment. We analyzed 200 replicates consisting of 209 case subjects and 488 control subjects and compared the results to weight-based and step-up aggregation methods. The principal components and collapsing method showed an association between the gene FLT1 and the quantitative trait Q1 (P<10−30) in a fraction of the computation time of the other methods. The proposed family-based test has inconclusive results. The two methods provide a fast way to analyze simultaneously rare and common variants at the gene level while adjusting for covariates. However, further evaluation of the statistical efficiency of this approach is warranted.
R package is designed for developers of
R packages, to help rapidly, and sometimes fully automatically, create a graphical user interface for a command line
R package. The interface is built upon the
Tcl/Tk graphical interface included in
R. The package further facilitates the developer by loading in the help files from the command line functions to provide context sensitive help to the user with no additional effort from the developer. Passing a function as the argument to the routines in the fgui package creates a graphical interface for the function, and further options are available to tweak this interface for those who want more flexibility.
GUI; interface; fgui
When testing for genetic effects, failure to account for a gene-environment interaction can mask the true association effects of a genetic marker with disease. Family-based association tests are popular because they are completely robust to population substructure and model misspecification. However, when testing for an interaction, failure to model the main genetic effect correctly can lead to spurious results. Here we propose a family-based test for interaction that is robust to model misspecification, but still sensitive to an interaction effect, and can handle continuous covariates and missing parents. We extend the FBAT-I gene-environment interaction test for dichotomous traits to using both trios and sibships. We then compare this extension to joint tests of gene and gene-environment interaction, and compare the joint test additionally to the main effects test of the gene. Lastly we apply these three tests to a group of nuclear families ascertained according to affection with Bipolar Disorder.
genetic association; genetic interaction; family-based test; FBAT-I
With the advent of high throughput genomics and high-resolution imaging techniques, there is a growing necessity in biology and medicine for parallel computing, and with the low cost of computing, it is now cost-effective for even small labs or individuals to build their own personal computation cluster.
Here we briefly describe how to use commodity hardware to build a low-cost, high-performance compute cluster, and provide an in-depth example and sample code for parallel execution of R jobs using MOSIX, a mature extension of the Linux kernel for parallel computing. A similar process can be used with other cluster platform software.
As a statistical genetics example, we use our cluster to run a simulated eQTL experiment. Because eQTL is computationally intensive, and is conceptually easy to parallelize, like many statistics/genetics applications, parallel execution with MOSIX gives a linear speedup in analysis time with little additional effort.
We have used MOSIX to run a wide variety of software programs in parallel with good results. The limitations and benefits of using MOSIX are discussed and compared to other platforms.
Recent findings suggest that rare variants play an important role in both monogenic and common diseases. Due to their rarity, however, it remains unclear how to appropriately analyze the association between such variants and disease. A common approach entails combining rare variants together based on a priori information and analyzing them as a single group. Here one must make some assumptions about what to aggregate. Instead, we propose two approaches to empirically determine the most efficient grouping of rare variants. The first considers multiple possible groupings using existing information. The second is an agnostic “step-up” approach that determines an optimal grouping of rare variants analytically and does not rely on prior information. To evaluate these approaches, we undertook a simulation study using sequence data from genes in the one-carbon folate metabolic pathway. Our results show that using prior information to group rare variants is advantageous only when information is quite accurate, but the step-up approach works well across a broad range of plausible scenarios. This agnostic approach allows one to efficiently analyze the association between rare variants and disease while avoiding assumptions required by other approaches for grouping such variants.
Numerous studies have demonstrated associations between genetic markers and COPD, but results have been inconsistent. One reason may be heterogeneity in disease definition. Unsupervised learning approaches may assist in understanding disease heterogeneity.
We selected 31 phenotypic variables and 12 SNPs from five candidate genes in 308 subjects in the National Emphysema Treatment Trial (NETT) Genetics Ancillary Study cohort. We used factor analysis to select a subset of phenotypic variables, and then used cluster analysis to identify subtypes of severe emphysema. We examined the phenotypic and genotypic characteristics of each cluster.
We identified six factors accounting for 75% of the shared variability among our initial phenotypic variables. We selected four phenotypic variables from these factors for cluster analysis: 1) post-bronchodilator FEV1 percent predicted, 2) percent bronchodilator responsiveness, and quantitative CT measurements of 3) apical emphysema and 4) airway wall thickness. K-means cluster analysis revealed four clusters, though separation between clusters was modest: 1) emphysema predominant, 2) bronchodilator responsive, with higher FEV1; 3) discordant, with a lower FEV1 despite less severe emphysema and lower airway wall thickness, and 4) airway predominant. Of the genotypes examined, membership in cluster 1 (emphysema-predominant) was associated with TGFB1 SNP rs1800470.
Cluster analysis may identify meaningful disease subtypes and/or groups of related phenotypic variables even in a highly selected group of severe emphysema subjects, and may be useful for genetic association studies.
Berkeley sickle cell mice are used as an animal model of human sickle cell disease but there are no reports of platelet studies in this model. Since humans with sickle cell disease have platelet abnormalities, we studied platelet morphology and function in Berkeley mice (SS). We observed elevated mean platelet forward angle light scatter (FSC) values (an indirect measure of platelet volume) in SS compared to wild type (WT) (37 ± 3.2 vs. 27 ± 1.4, mean ± SD; p <0.001), in association with moderate thrombocytopenia (505 ± 49 × 103/μl vs. 1151 ± 162 × 103/μl; p <0.001). Despite having marked splenomegaly, SS mice had elevated levels of Howell-Jolly bodies and “pocked” erythrocytes (p <0.001 for both) suggesting splenic dysfunction. SS mice also had elevated numbers of thiazole orange positive platelets (5 ± 1 % vs. 1 ± 1%; p <0.001), normal to low plasma thrombopoietin levels, normal plasma glycocalicin levels, normal levels of platelet recovery, and near normal platelet life spans. Platelets from SS mice bound more fibrinogen and antibody to P-selectin following activation with a threshold concentration of a protease activated receptor (PAR)-4 peptide compared to WT mice. Enlarged platelets are associated with a predisposition to arterial thrombosis in humans and some humans with SCD have been reported to have large platelets. Thus, additional studies are needed to assess whether large platelets contribute either to pulmonary hypertension or the large vessel arterial occlusion that produces stroke in some children with sickle cell disease.
Sickle cell; Berkeley mouse model; Platelet size