Genome-wide association studies (GWAS) have identified thousands of genetic variants that influence a variety of diseases and health-related quantitative traits. However, the causal variants underlying the majority of genetic associations remain unknown. The Cohorts for Heart and Aging Research in Genomic Epidemiology (CHARGE) Targeted Sequencing Study aims to follow up GWAS signals and identify novel associations of the allelic spectrum of identified variants with cardiovascular related traits.
Methods and Results
The study included 4,231 participants from three CHARGE cohorts: the Atherosclerosis Risk in Communities Study, the Cardiovascular Health Study, and the Framingham Heart Study. We used a case-cohort design in which we selected both a random sample of participants and participants with extreme phenotypes for each of 14 traits. We sequenced and analyzed 77 genomic loci, which had previously been associated with one or more of 14 phenotypes. A total of 52,736 variants were characterized by sequencing and passed our stringent quality control criteria. For common variants (minor allele frequency ≥1%), we performed unweighted regression analyses to obtain p-values for associations and weighted regression analyses to obtain effect estimates that accounted for the sampling design. For rare variants, we applied two approaches: collapsed aggregate statistics and joint analysis of variants using the Sequence Kernel Association Test.
We sequenced 77 genomic loci in participants from three cohorts. We established a set of filters to identify high-quality variants, and implemented statistical and bioinformatics strategies to analyze the sequence data, and identify potentially functional variants within GWAS loci.
genetics; epidemiology; CHARGE; sampling; targeted sequencing
Chromatin structure, in terms of positioning of nucleosomes and nucleosome-free regions in the DNA, has been found to have an immense impact on various cell functions and processes, ranging from transcriptional regulation to growth and development. In spite of numerous experimental and computational approaches being developed in the past few years to determine the intrinsic relationship between chromatin structure (nucleosome positioning) and DNA sequence features, there is yet no universally accurate approach to predict nucleosome positioning from the underlying DNA sequence alone. We here propose an alternative approach to predicting nucleosome positioning from sequence, making use of characteristic sequence differences, and inherent dependencies in overlapping sequence features. Our nucleosomal positioning prediction algorithm, based on the idea of generalized hierarchical hidden Markov models (HGHMMs), was used to predict nucleosomal state based on the DNA sequence in yeast chromosome III, and compared with two other existing methods. The HGHMM method performed favorably among the three models in terms of specificity and sensitivity, and provided estimates that were largely consistent with predictions from the method of Yuan and Liu (2008). However, all the methods still give higher than desirable misclassification rates, indicating that sequence-based features may provide only limited information towards understanding positioning of nucleosomes. The method is implemented in the open-source statistical software R, and is freely available from the authors’ website.
chromatin; nucleosome; DNA sequence; Bayesian modeling
The basic reproductive number (R₀) and the distribution of the serial interval (SI) are often used to quantify transmission during an infectious disease outbreak. In this paper, we present estimates of R₀ and SI from the 2003 SARS outbreak in Hong Kong and Singapore, and the 2009 pandemic influenza A(H1N1) outbreak in South Africa using methods that expand upon an existing Bayesian framework. This expanded framework allows for the incorporation of additional information, such as contact tracing or household data, through prior distributions. The results for the R₀ and the SI from the influenza outbreak in South Africa were similar regardless of the prior information (R^0 = 1.36–1.46, μ^ = 2.0–2.7, μ^ = mean of the SI). The estimates of R₀ and μ for the SARS outbreak ranged from 2.0–4.4 and 7.4–11.3, respectively, and were shown to vary depending on the use of contact tracing data. The impact of the contact tracing data was likely due to the small number of SARS cases relative to the size of the contact tracing sample.
Genome-wide association studies (GWAS) have identified common genetic variants that predispose to atrial fibrillation (AF). It is unclear whether rare and low-frequency variants in genes implicated by such GWAS confer additional risk of AF.
To study the association of genetic variants with AF at GWAS top loci.
In the Cohorts for Heart and Aging Research in Genomic Epidemiology (CHARGE) Targeted Sequencing Study, we selected and sequenced 77 target gene regions from GWAS loci of complex diseases or traits, including 4 genes hypothesized to be related to AF (PRRX1, CAV1, CAV2, and ZFHX3). Sequencing was performed in participants with (n = 948) and without (n = 3330) AF from the Atherosclerosis Risk in Communities Study, the Cardiovascular Health Study, the Framingham Heart Study, and the Massachusetts General Hospital.
One common variant (rs11265611; P = 1.70 × 10−6) intronic to IL6R (interleukin-6 receptor gene) was significantly associated with AF after Bonferroni correction (odds ratio 0.70; 95% confidence interval 0.58–0.85). The variant was not genotyped or imputed by prior GWAS, but it is in linkage disequilibrium (r2 = .69) with the single-nucleotide polymorphism, with the strongest association with AF so far at this locus (rs4845625). In the rare variant joint analysis, damaging variants within the PRRX1 region showed significant association with AF after Bonferroni correction (P = .01).
We identified 1 common single-nucleotide polymorphism and 1 gene region that were significantly associated with AF. Future sequencing efforts with larger sample sizes and more comprehensive genome coverage are anticipated to identify additional AF-related variants.
Arrhythmia; Genetics; Atrial fibrillation; Epidemiology
Stroke, the leading neurologic cause of death and disability, has a substantial genetic component. We previously conducted a genome-wide association study (GWAS) in four prospective studies from the Cohorts for Heart and Aging Research in Genomic Epidemiology (CHARGE) consortium and demonstrated that sequence variants near the NINJ2 gene are associated with incident ischemic stroke. Here, we sought to fine-map functional variants in the region and evaluate the contribution of rare variants to ischemic stroke risk.
Methods and Results
We sequenced 196 kb around NINJ2 on chromosome 12p13 among 3,986 European ancestry participants, including 475 ischemic stroke cases, from the Atherosclerosis Risk in Communities Study, Cardiovascular Health Study, and Framingham Heart Study. Meta-analyses of single-variant tests for 425 common variants (minor allele frequency [MAF] ≥ 1%) confirmed the original GWAS results and identified an independent intronic variant, rs34166160 (MAF = 0.012), most significantly associated with incident ischemic stroke (HR = 1.80, p = 0.0003). Aggregating 278 putatively-functional variants with MAF≤ 1% using count statistics, we observed a nominally statistically significant association, with the burden of rare NINJ2 variants contributing to decreased ischemic stroke incidence (HR = 0.81; p = 0.026).
Common and rare variants in the NINJ2 region were nominally associated with incident ischemic stroke among a subset of CHARGE participants. Allelic heterogeneity at this locus, caused by multiple rare, low frequency, and common variants with disparate effects on risk, may explain the difficulties in replicating the original GWAS results. Additional studies that take into account the complex allelic architecture at this locus are needed to confirm these findings.
Large-scale chromosomal deletions or other non-specific perturbations of the transcriptome can alter the expression of hundreds or thousands of genes, and it is of biological interest to understand which genes are most profoundly affected. We present a method for predicting a gene’s expression as a function of other genes thereby accounting for the effect of transcriptional regulation that confounds the identification of genes differentially expressed relative to a regulatory network. The challenge in constructing such models is that the number of possible regulator transcripts within a global network is on the order of thousands, and the number of biological samples is typically on the order of 10. Nevertheless, there are large gene expression databases that can be used to construct networks that could be helpful in modeling transcriptional regulation in smaller experiments.
We demonstrate a type of penalized regression model that can be estimated from large gene expression databases, and then applied to smaller experiments. The ridge parameter is selected by minimizing the cross-validation error of the predictions in the independent out-sample. This tends to increase the model stability and leads to a much greater degree of parameter shrinkage, but the resulting biased estimation is mitigated by a second round of regression. Nevertheless, the proposed computationally efficient “over-shrinkage” method outperforms previously used LASSO-based techniques. In two independent datasets, we find that the median proportion of explained variability in expression is approximately 25%, and this results in a substantial increase in the signal-to-noise ratio allowing more powerful inferences on differential gene expression leading to biologically intuitive findings. We also show that a large proportion of gene dependencies are conditional on the biological state, which would be impossible with standard differential expression methods.
By adjusting for the effects of the global network on individual genes, both the sensitivity and reliability of differential expression measures are greatly improved.
Accurately modeling LD in simulations is essential to correctly evaluate new and existing association methods. At present, there has been minimal research comparing the quality of existing gene region simulation methods to produce LD structures similar to an existing gene region. Here we compare the ability of three approaches to accurately simulate the LD within a gene region: HapSim (2005), Hapgen (2009), and a minor extension to simple haplotype resampling.
In order to observe the variation and bias for each method, we compare the simulated pairwise LD measures and minor allele frequencies to the original HapMap data in an extensive simulation study. When possible, we also evaluate the effects of changing parameters.
HapSim produces samples of haplotypes with lower LD, on average, compared to the original haplotype set while both our resampling method and Hapgen do not introduce this bias. The variation introduced across the replicates by our resampling method is quite small and may not provide enough sampling variability to make a generalizable simulation study.
We recommend using Hapgen to simulate replicate haplotypes from a gene region. Hapgen produces moderate sampling variation between the replicates while retaining the overall unique LD structure of the gene region.
Nucleosomes are units of chromatin structure, consisting of DNA sequence wrapped around proteins called “histones.” Nucleosomes occur at variable intervals throughout genomic DNA and prevent transcription factor (TF) binding by blocking TF access to the DNA. A map of nucleosomal locations would enable researchers to detect TF binding sites with greater efficiency. Our objective is to construct an accurate genomic map of nucleosome-free regions (NFRs) based on data from high-throughput genomic tiling arrays in yeast. These high-volume data typically have a complex structure in the form of dependence on neighboring probes as well as underlying DNA sequence, variable-sized gaps, and missing data. We propose a novel continuous-index model appropriate for non-equispaced tiling array data that simultaneously incorporates DNA sequence features relevant to nucleosome formation. Simulation studies and an application to a yeast nucleosomal assay demonstrate the advantages of using the new modeling framework, as well as its robustness to distributional misspecifications. Our results reinforce the previous biological hypothesis that higher-order nucleotide combinations are important in distinguishing nucleosomal regions from NFRs.
Chromatin structure; Data augmentation; FAIRE; Tiling arrays
Muscle atrophy remains a significant concern in multiple inflammatory conditions, including injury, sepsis, cachexia and HIV associated wasting. Herein, we show that inflammatory stressors, including TNF–α, IFN–γ or LPS, potently induced the novel expression of the RNA editor ADAR1, an observation not previously described in muscle cells. We also observed that cytokine stimulation suppressed muscle associated microRNAs, an observation also not previously demonstrated. To map potential effects of ADAR1 induction in the muscle program, we conducted knockdown and over-expression studies in the mouse C2C12 muscle precursor cell (MPC) line and in primary human MPCs. We show that knockdown of stress-induced ADAR1 increased inflammation-mediated declines in the muscle differentiation markers myogenin and myosin heavy chain, and knockdown reduced levels of active phosphorylated Akt (phospho-Akt), but had no effect on microRNA transcript levels, suggesting a role for ADAR1 in buffering inflammatory stress effects on myogenic transcription and protein synthesis pathways. Additionally, over-expression of recombinant ADAR1 suppressed active phosphorylated dsRNA-dependent protein kinase (phospho-PKR), consistent with a role for ADAR1 in limiting inflammation driven catabolic atrophy pathways. Collectively, these data identify a novel regulatory role for ADAR1 activation under inflammatory stress to both promote muscle protein synthesis pathways and limit atrophy pathways.
An important challenge in analyzing high dimensional data in regression settings is that of facing a situation in which the number of covariates p in the model greatly exceeds the sample size n (sometimes termed the “p > n” problem). In this article, we develop a novel specification for a general class of prior distributions, called Information Matrix (IM) priors, for high-dimensional generalized linear models. The priors are first developed for settings in which p < n, and then extended to the p > n case by defining a ridge parameter in the prior construction, leading to the Information Matrix Ridge (IMR) prior. The IM and IMR priors are based on a broad generalization of Zellner’s g-prior for Gaussian linear models. Various theoretical properties of the prior and implied posterior are derived including existence of the prior and posterior moment generating functions, tail behavior, as well as connections to Gaussian priors and Jeffreys’ prior. Several simulation studies and an application to a nucleosomal positioning data set demonstrate its advantages over Gaussian, as well as g-priors, in high dimensional settings.
Fisher Information; g-prior; Importance sampling; Model identifiability; Prior elicitation
In this article we propose a maximal a posteriori (MAP) criterion for model selection in the motif discovery problem and investigate conditions under which the MAP asymptotically gives a correct prediction of model size. We also investigate robustness of the MAP to prior specification and provide guidelines for choosing prior hyper-parameters for motif models based on sensitivity considerations.
Motif discovery; Model selection; Bayes factor; MAP
Motivation: Full-length DNA and protein sequences that span the entire length of a gene are ideally used for multiple sequence alignments (MSAs) and the subsequent inference of their relationships. Frequently, however, MSAs contain a substantial amount of missing data. For example, expressed sequence tags (ESTs), which are partial sequences of expressed genes, are the predominant source of sequence data for many organisms. The patterns of missing data typical for EST-derived alignments greatly compromise the accuracy of estimated phylogenies.
Results: We present a statistical method for inferring phylogenetic trees from EST-based incomplete MSA data. We propose a class of hierarchical models for modeling pairwise distances between the sequences, and develop a fully Bayesian approach for estimation of the model parameters. Once the distance matrix is estimated, the phylogenetic tree may be constructed by applying neighbor-joining (or any other algorithm of choice). We also show that maximizing the marginal likelihood from the Bayesian approach yields similar results to a profile likelihood estimation. The proposed methods are illustrated using simulated protein families, for which the true phylogeny is known, and one real protein family.
Availability: R code for fitting these models are available from: http://people.bu.edu/gupta/software.htm.
Supplementary information: Supplemantary data are available at Bioinformatics online.
We propose a unified framework for the analysis of Chromatin (Ch) Immunoprecipitation (IP) microarray (ChIP-chip) data for detecting transcription factor binding sites (TFBSs) or motifs. ChIP-chip assays are used to focus the genome-wide search for TFBSs by isolating a sample of DNA fragments with TFBSs and applying this sample to a microarray with probes corresponding to tiled segments across the genome. Present analytical methods use a two-step approach: (i) analyze array data to estimate IP enrichment peaks then (ii) analyze the corresponding sequences independently of intensity information. The proposed model integrates peak finding and motif discovery through a unified Bayesian hidden Markov model (HMM) framework that accommodates the inherent uncertainty in both measurements. A Markov Chain Monte Carlo algorithm is formulated for parameter estimation, adapting recursive techniques used for HMMs. In simulations and applications to a yeast RAP1 dataset, the proposed method has favorable TFBS discovery performance compared to currently available two-stage procedures in terms of both sensitivity and specificity.
Data augmentation; Gene regulation; Tiling array; Transcription factor binding site
Oxidative DNA damage is one of the key events thought to be involved in mutation and cancer. The present study examined the accumulation of M1dG, 3-(2′-deoxy-β-D-erythro-pentofuranosyl)-pyrimido[1,2-a]-purin-10(3H)-one, DNA adducts after single dose or one-year exposure to polyhalogenated aromatic hydrocarbons (PHAH) in order to evaluate the potential role of oxidative DNA damage in PHAH toxicity and carcinogenicity. The effect of PHAH exposure on the number of M1dG adducts was explored initially in female mice exposed to a single dose of either 2,3,7,8-tetrachlorodibenzo-p-dioxin (TCDD) or a PHAH mixture. This study demonstrated that a single exposure to PHAH had no significant effect on the number of M1dG adducts compared to the corn oil control group. The role of M1dG adducts in polychlorinated biphenyl (PCB) induced toxicity and carcinogenicity was further investigated in rats exposed for a year to PCB 153, PCB 126, or a mixture of the two. PCB 153, at doses up to 3000 μg/kg/d, had no significant effect on the number of M1dG adducts in liver and brain tissues from the exposed rats compared to controls. However, 1000 ng/kg/d of PCB 126 resulted in M1dG adduct accumulation in the liver. More importantly, co-administration of equal proportions of PCB 153 and PCB 126 resulted in dose-dependent increases in M1dG adduct accumulation in the liver from 300-1000 ng/kg/d of PCB 126 with 300-1000 μg/kg/d of PCB 153. Interestingly, the co-administration of different amounts of PCB 153 with fixed amounts of PCB 126 demonstrated more M1dG adduct accumulation with higher doses of PCB 153. These results are consistent with the results from cancer bioassays that demonstrated a synergistic effect between PCB 126 and PCB 153 on toxicity and tumor development. In summary, the results from the present study support the hypothesis that oxidative DNA damage plays a key role in toxicity and carcinogenicity following long-term PCB exposure.