Search tips
Search criteria

Results 1-10 (10)

Clipboard (0)

Select a Filter Below

Year of Publication
Document Types
author:("Gupta, mayer")
1.  Associations of NINJ2 Sequence Variants with Incident Ischemic Stroke in the Cohorts for Heart and Aging in Genomic Epidemiology (CHARGE) Consortium 
PLoS ONE  2014;9(6):e99798.
Stroke, the leading neurologic cause of death and disability, has a substantial genetic component. We previously conducted a genome-wide association study (GWAS) in four prospective studies from the Cohorts for Heart and Aging Research in Genomic Epidemiology (CHARGE) consortium and demonstrated that sequence variants near the NINJ2 gene are associated with incident ischemic stroke. Here, we sought to fine-map functional variants in the region and evaluate the contribution of rare variants to ischemic stroke risk.
Methods and Results
We sequenced 196 kb around NINJ2 on chromosome 12p13 among 3,986 European ancestry participants, including 475 ischemic stroke cases, from the Atherosclerosis Risk in Communities Study, Cardiovascular Health Study, and Framingham Heart Study. Meta-analyses of single-variant tests for 425 common variants (minor allele frequency [MAF] ≥ 1%) confirmed the original GWAS results and identified an independent intronic variant, rs34166160 (MAF = 0.012), most significantly associated with incident ischemic stroke (HR = 1.80, p = 0.0003). Aggregating 278 putatively-functional variants with MAF≤ 1% using count statistics, we observed a nominally statistically significant association, with the burden of rare NINJ2 variants contributing to decreased ischemic stroke incidence (HR = 0.81; p = 0.026).
Common and rare variants in the NINJ2 region were nominally associated with incident ischemic stroke among a subset of CHARGE participants. Allelic heterogeneity at this locus, caused by multiple rare, low frequency, and common variants with disparate effects on risk, may explain the difficulties in replicating the original GWAS results. Additional studies that take into account the complex allelic architecture at this locus are needed to confirm these findings.
PMCID: PMC4069013  PMID: 24959832
2.  Differential expression analysis with global network adjustment 
BMC Bioinformatics  2013;14:258.
Large-scale chromosomal deletions or other non-specific perturbations of the transcriptome can alter the expression of hundreds or thousands of genes, and it is of biological interest to understand which genes are most profoundly affected. We present a method for predicting a gene’s expression as a function of other genes thereby accounting for the effect of transcriptional regulation that confounds the identification of genes differentially expressed relative to a regulatory network. The challenge in constructing such models is that the number of possible regulator transcripts within a global network is on the order of thousands, and the number of biological samples is typically on the order of 10. Nevertheless, there are large gene expression databases that can be used to construct networks that could be helpful in modeling transcriptional regulation in smaller experiments.
We demonstrate a type of penalized regression model that can be estimated from large gene expression databases, and then applied to smaller experiments. The ridge parameter is selected by minimizing the cross-validation error of the predictions in the independent out-sample. This tends to increase the model stability and leads to a much greater degree of parameter shrinkage, but the resulting biased estimation is mitigated by a second round of regression. Nevertheless, the proposed computationally efficient “over-shrinkage” method outperforms previously used LASSO-based techniques. In two independent datasets, we find that the median proportion of explained variability in expression is approximately 25%, and this results in a substantial increase in the signal-to-noise ratio allowing more powerful inferences on differential gene expression leading to biologically intuitive findings. We also show that a large proportion of gene dependencies are conditional on the biological state, which would be impossible with standard differential expression methods.
By adjusting for the effects of the global network on individual genes, both the sensitivity and reliability of differential expression measures are greatly improved.
PMCID: PMC3766173  PMID: 23968143
3.  A Comparison of Gene Region Simulation Methods 
PLoS ONE  2012;7(7):e40925.
Accurately modeling LD in simulations is essential to correctly evaluate new and existing association methods. At present, there has been minimal research comparing the quality of existing gene region simulation methods to produce LD structures similar to an existing gene region. Here we compare the ability of three approaches to accurately simulate the LD within a gene region: HapSim (2005), Hapgen (2009), and a minor extension to simple haplotype resampling.
Methodology/Principal Findings
In order to observe the variation and bias for each method, we compare the simulated pairwise LD measures and minor allele frequencies to the original HapMap data in an extensive simulation study. When possible, we also evaluate the effects of changing parameters.
HapSim produces samples of haplotypes with lower LD, on average, compared to the original haplotype set while both our resampling method and Hapgen do not introduce this bias. The variation introduced across the replicates by our resampling method is quite small and may not provide enough sampling variability to make a generalizable simulation study.
We recommend using Hapgen to simulate replicate haplotypes from a gene region. Hapgen produces moderate sampling variation between the replicates while retaining the overall unique LD structure of the gene region.
PMCID: PMC3399793  PMID: 22815869
4.  A continuous-index Bayesian hidden Markov model for prediction of nucleosome positioning in genomic DNA 
Biostatistics (Oxford, England)  2010;12(3):462-477.
Nucleosomes are units of chromatin structure, consisting of DNA sequence wrapped around proteins called “histones.” Nucleosomes occur at variable intervals throughout genomic DNA and prevent transcription factor (TF) binding by blocking TF access to the DNA. A map of nucleosomal locations would enable researchers to detect TF binding sites with greater efficiency. Our objective is to construct an accurate genomic map of nucleosome-free regions (NFRs) based on data from high-throughput genomic tiling arrays in yeast. These high-volume data typically have a complex structure in the form of dependence on neighboring probes as well as underlying DNA sequence, variable-sized gaps, and missing data. We propose a novel continuous-index model appropriate for non-equispaced tiling array data that simultaneously incorporates DNA sequence features relevant to nucleosome formation. Simulation studies and an application to a yeast nucleosomal assay demonstrate the advantages of using the new modeling framework, as well as its robustness to distributional misspecifications. Our results reinforce the previous biological hypothesis that higher-order nucleotide combinations are important in distinguishing nucleosomal regions from NFRs.
PMCID: PMC3114652  PMID: 21193724
Chromatin structure; Data augmentation; FAIRE; Tiling arrays
5.  The RNA Editor Gene Adar1 is Induced in Myoblasts by Inflammatory Ligands and Buffers Stress Response 
Muscle atrophy remains a significant concern in multiple inflammatory conditions, including injury, sepsis, cachexia and HIV associated wasting. Herein, we show that inflammatory stressors, including TNF–α, IFN–γ or LPS, potently induced the novel expression of the RNA editor ADAR1, an observation not previously described in muscle cells. We also observed that cytokine stimulation suppressed muscle associated microRNAs, an observation also not previously demonstrated. To map potential effects of ADAR1 induction in the muscle program, we conducted knockdown and over-expression studies in the mouse C2C12 muscle precursor cell (MPC) line and in primary human MPCs. We show that knockdown of stress-induced ADAR1 increased inflammation-mediated declines in the muscle differentiation markers myogenin and myosin heavy chain, and knockdown reduced levels of active phosphorylated Akt (phospho-Akt), but had no effect on microRNA transcript levels, suggesting a role for ADAR1 in buffering inflammatory stress effects on myogenic transcription and protein synthesis pathways. Additionally, over-expression of recombinant ADAR1 suppressed active phosphorylated dsRNA-dependent protein kinase (phospho-PKR), consistent with a role for ADAR1 in limiting inflammation driven catabolic atrophy pathways. Collectively, these data identify a novel regulatory role for ADAR1 activation under inflammatory stress to both promote muscle protein synthesis pathways and limit atrophy pathways.
PMCID: PMC2897727  PMID: 20590675
6.  An Information Matrix Prior for Bayesian Analysis in Generalized Linear Models with High Dimensional Data 
Statistica Sinica  2009;19(4):1641-1663.
An important challenge in analyzing high dimensional data in regression settings is that of facing a situation in which the number of covariates p in the model greatly exceeds the sample size n (sometimes termed the “p > n” problem). In this article, we develop a novel specification for a general class of prior distributions, called Information Matrix (IM) priors, for high-dimensional generalized linear models. The priors are first developed for settings in which p < n, and then extended to the p > n case by defining a ridge parameter in the prior construction, leading to the Information Matrix Ridge (IMR) prior. The IM and IMR priors are based on a broad generalization of Zellner’s g-prior for Gaussian linear models. Various theoretical properties of the prior and implied posterior are derived including existence of the prior and posterior moment generating functions, tail behavior, as well as connections to Gaussian priors and Jeffreys’ prior. Several simulation studies and an application to a nucleosomal positioning data set demonstrate its advantages over Gaussian, as well as g-priors, in high dimensional settings.
PMCID: PMC2909687  PMID: 20664718
Fisher Information; g-prior; Importance sampling; Model identifiability; Prior elicitation
7.  Model selection and sensitivity analysis for sequence pattern models 
In this article we propose a maximal a posteriori (MAP) criterion for model selection in the motif discovery problem and investigate conditions under which the MAP asymptotically gives a correct prediction of model size. We also investigate robustness of the MAP to prior specification and provide guidelines for choosing prior hyper-parameters for motif models based on sensitivity considerations.
PMCID: PMC2887058  PMID: 20563269
Motif discovery; Model selection; Bayes factor; MAP
8.  A hierarchical model for incomplete alignments in phylogenetic inference 
Bioinformatics  2009;25(5):592-598.
Motivation: Full-length DNA and protein sequences that span the entire length of a gene are ideally used for multiple sequence alignments (MSAs) and the subsequent inference of their relationships. Frequently, however, MSAs contain a substantial amount of missing data. For example, expressed sequence tags (ESTs), which are partial sequences of expressed genes, are the predominant source of sequence data for many organisms. The patterns of missing data typical for EST-derived alignments greatly compromise the accuracy of estimated phylogenies.
Results: We present a statistical method for inferring phylogenetic trees from EST-based incomplete MSA data. We propose a class of hierarchical models for modeling pairwise distances between the sequences, and develop a fully Bayesian approach for estimation of the model parameters. Once the distance matrix is estimated, the phylogenetic tree may be constructed by applying neighbor-joining (or any other algorithm of choice). We also show that maximizing the marginal likelihood from the Bayesian approach yields similar results to a profile likelihood estimation. The proposed methods are illustrated using simulated protein families, for which the true phylogeny is known, and one real protein family.
Availability: R code for fitting these models are available from:
Supplementary information: Supplemantary data are available at Bioinformatics online.
PMCID: PMC2647833  PMID: 19147663
9.  A Bayesian hidden Markov model for motif discovery through joint modeling of genomic sequence and ChIP-chip data 
Biometrics  2009;65(4):1087-1095.
We propose a unified framework for the analysis of Chromatin (Ch) Immunoprecipitation (IP) microarray (ChIP-chip) data for detecting transcription factor binding sites (TFBSs) or motifs. ChIP-chip assays are used to focus the genome-wide search for TFBSs by isolating a sample of DNA fragments with TFBSs and applying this sample to a microarray with probes corresponding to tiled segments across the genome. Present analytical methods use a two-step approach: (i) analyze array data to estimate IP enrichment peaks then (ii) analyze the corresponding sequences independently of intensity information. The proposed model integrates peak finding and motif discovery through a unified Bayesian hidden Markov model (HMM) framework that accommodates the inherent uncertainty in both measurements. A Markov Chain Monte Carlo algorithm is formulated for parameter estimation, adapting recursive techniques used for HMMs. In simulations and applications to a yeast RAP1 dataset, the proposed method has favorable TFBS discovery performance compared to currently available two-stage procedures in terms of both sensitivity and specificity.
PMCID: PMC2794970  PMID: 19210737
Data augmentation; Gene regulation; Tiling array; Transcription factor binding site
10.  Accumulation of M1dG DNA adducts after chronic exposure to PCBs, but not from acute exposure to polychlorinated aromatic hydrocarbons 
Free radical biology & medicine  2008;45(5):585-591.
Oxidative DNA damage is one of the key events thought to be involved in mutation and cancer. The present study examined the accumulation of M1dG, 3-(2′-deoxy-β-D-erythro-pentofuranosyl)-pyrimido[1,2-a]-purin-10(3H)-one, DNA adducts after single dose or one-year exposure to polyhalogenated aromatic hydrocarbons (PHAH) in order to evaluate the potential role of oxidative DNA damage in PHAH toxicity and carcinogenicity. The effect of PHAH exposure on the number of M1dG adducts was explored initially in female mice exposed to a single dose of either 2,3,7,8-tetrachlorodibenzo-p-dioxin (TCDD) or a PHAH mixture. This study demonstrated that a single exposure to PHAH had no significant effect on the number of M1dG adducts compared to the corn oil control group. The role of M1dG adducts in polychlorinated biphenyl (PCB) induced toxicity and carcinogenicity was further investigated in rats exposed for a year to PCB 153, PCB 126, or a mixture of the two. PCB 153, at doses up to 3000 μg/kg/d, had no significant effect on the number of M1dG adducts in liver and brain tissues from the exposed rats compared to controls. However, 1000 ng/kg/d of PCB 126 resulted in M1dG adduct accumulation in the liver. More importantly, co-administration of equal proportions of PCB 153 and PCB 126 resulted in dose-dependent increases in M1dG adduct accumulation in the liver from 300-1000 ng/kg/d of PCB 126 with 300-1000 μg/kg/d of PCB 153. Interestingly, the co-administration of different amounts of PCB 153 with fixed amounts of PCB 126 demonstrated more M1dG adduct accumulation with higher doses of PCB 153. These results are consistent with the results from cancer bioassays that demonstrated a synergistic effect between PCB 126 and PCB 153 on toxicity and tumor development. In summary, the results from the present study support the hypothesis that oxidative DNA damage plays a key role in toxicity and carcinogenicity following long-term PCB exposure.
PMCID: PMC2570591  PMID: 18534201

Results 1-10 (10)