1.  Drug target inference through pathway analysis of genomics data 
Advanced drug delivery reviews  2013;65(7):966-972.
Statistical modeling coupled with bioinformatics is commonly used for drug discovery. Although there exist many approaches for single target based drug design and target inference, recent years have seen a paradigm shift to system-level pharmacological research. Pathway analysis of genomics data represents one promising direction for computational inference of drug targets. This article aims at providing a comprehensive review on the evolving issues is this field, covering methodological developments, their pros and cons, as well as future research directions.
PMCID: PMC3672337  PMID: 23369829
Drug target inference; pathway analysis; genomics; statistical modeling; factor model; data mining; optimization
2.  Empirical Bayes Correction for the Winner's Curse in Genetic Association Studies 
Genetic epidemiology  2012;37(1):60-68.
We consider an Empirical Bayes method to correct for the Winner's Curse phenomenon in genome-wide association studies. Our method utilizes the collective distribution of all odds ratios (ORs) to determine the appropriate correction for a particular single-nucleotide polymorphism (SNP). We can show that this approach is squared error optimal provided that this collective distribution is accurately estimated in its tails. To improve the performance when correcting the OR estimates for the most highly associated SNPs, we develop a second estimator that adaptively combines the Empirical Bayes estimator with a previously considered Conditional Likelihood estimator. The applications of these methods to both simulated and real data suggest improved performance in reducing selection bias.
PMCID: PMC4048064  PMID: 23012258
GWAS; Empirical Bayes; Winner's Curse
3.  Statistical tests for detecting associations with groups of genetic variants: generalization, evaluation, and implementation 
With recent advances in sequencing, genotyping arrays, and imputation, GWAS now aim to identify associations with rare and uncommon genetic variants. Here, we describe and evaluate a class of statistics, generalized score statistics (GSS), that can test for an association between a group of genetic variants and a phenotype. GSS are a simple weighted sum of single-variant statistics and their cross-products. We show that the majority of statistics currently used to detect associations with rare variants are equivalent to choosing a specific set of weights within this framework. We then evaluate the power of various weighting schemes as a function of variant characteristics, such as MAF, the proportion associated with the phenotype, and the direction of effect. Ultimately, we find that two classical tests are robust and powerful, but details are provided as to when other GSS may perform favorably. The software package CRaVe is available at our website (
PMCID: PMC3658182  PMID: 23092956
rare variants; score test; GWAS; association test
4.  Data Pre-Processing for Label-Free Multiple Reaction Monitoring (MRM) Experiments 
Biology  2014;3(2):383-402.
Multiple Reaction Monitoring (MRM) conducted on a triple quadrupole mass spectrometer allows researchers to quantify the expression levels of a set of target proteins. Each protein is often characterized by several unique peptides that can be detected by monitoring predetermined fragment ions, called transitions, for each peptide. Concatenating large numbers of MRM transitions into a single assay enables simultaneous quantification of hundreds of peptides and proteins. In recognition of the important role that MRM can play in hypothesis-driven research and its increasing impact on clinical proteomics, targeted proteomics such as MRM was recently selected as the Nature Method of the Year. However, there are many challenges in MRM applications, especially data pre‑processing where many steps still rely on manual inspection of each observation in practice. In this paper, we discuss an analysis pipeline to automate MRM data pre‑processing. This pipeline includes data quality assessment across replicated samples, outlier detection, identification of inaccurate transitions, and data normalization. We demonstrate the utility of our pipeline through its applications to several real MRM data sets.
PMCID: PMC4085614  PMID: 24905083
multiple reaction monitoring; label-free; quality assessment; data normalization; proteomics; peptide; transition
5.  CB2 Receptor Activation Ameliorates the Proinflammatory Activity in Acute Lung Injury Induced by Paraquat 
BioMed Research International  2014;2014:971750.
Paraquat, a widely used herbicide, is well known to exhibit oxidative stress and lung injury. In the present study, we investigated the possible underlying mechanisms of cannabinoid receptor-2 (CB2) activation to ameliorate the proinflammatory activity induced by PQ in rats. JWH133, a CB2 agonist, was administered by intraperitoneal injection 1 h prior to PQ exposure. After PQ exposure for 4, 8, 24, and 72 h, the bronchoalveolar lavage fluid was collected to determine levels of TNF-α and IL-1β, and the arterial blood samples were collected for detection of PaO2 level. At 72 h after PQ exposure, lung tissues were collected to determine the lung wet-to-dry weight ratios, myeloperoxidase activity, lung histopathology, the protein expression level of CB2, MAPKs (ERK1/2, p38MAPK, and JNK1/2), and NF-κBp65. After rats were pretreated with JWH133, PQ-induced lung edema and lung histopathological changes were significantly attenuated. PQ-induced TNF-α and IL-1β secretion in BALF, increases of PaO2 in arterial blood, and MPO levels in the lung tissue were significantly reduced. JWH133 could efficiently activate CB2, while inhibiting MAPKs and NF-κB activation. The results suggested that activating CB2 receptor exerted protective activity against PQ-induced ALI, and it potentially contributed to the suppression of the activation of MAPKs and NF-κB pathways.
PMCID: PMC4054852  PMID: 24963491
6.  T cell-intrinsic role of IL-6 signaling in primary and memory responses 
eLife  2014;3:e01949.
Innate immune recognition is critical for the induction of adaptive immune responses; however the underlying mechanisms remain incompletely understood. In this study, we demonstrate that T cell-specific deletion of the IL-6 receptor α chain (IL-6Rα) results in impaired Th1 and Th17 T cell responses in vivo, and a defect in Tfh function. Depletion of Tregs in these mice rescued the Th1 but not the Th17 response. Our data suggest that IL-6 signaling in effector T cells is required to overcome Treg-mediated suppression in vivo. We show that IL-6 cooperates with IL-1β to block the suppressive effect of Tregs on CD4+ T cells, at least in part by controlling their responsiveness to IL-2. In addition, although IL-6Rα-deficient T cells mount normal primary Th1 responses in the absence of Tregs, they fail to mature into functional memory cells, demonstrating a key role for IL-6 in CD4+ T cell memory formation.
eLife digest
The human body's ability to defend itself against pathogens relies on two distinct but connected systems: the innate and the adaptive immune systems. Innate immune cells survey their environment and use receptors located on their surface to distinguish between molecules that are harmless and molecules that stem from pathogens. When the cells of the innate immune system detect a pathogen, they secrete signaling molecules to alert adaptive immune cells to the invaders. Both sets of immune cells then mount a coordinated attack that usually kills the pathogen.
The adaptive immune system also produces memory cells that retain information about the pathogen: this allows the organism to mount a fast and efficient immune response the next time the same type of pathogen strikes. However, it is not completely understood how the innate immune system communicates with the adaptive immune system to allow these processes to take place.
One of the signaling molecules involved in the communication between different types of immune cells is a protein called Interleukin 6 (IL-6). This protein must be produced in order to trigger the immune response: however, many immune cells are able to recognize and respond to IL-6, so it has been difficult to study its impact on specific cell types.
Nish et al. have now investigated the effects of IL-6 on T cells, one of the main types of adaptive immune cell, by creating mice with T cells that are not able to recognize IL-6. The detection of pathogens by innate immune cells normally has several effects: the population of T cells increases; the T cells produce daughter cells—T helper cells—that support innate immune cells in killing pathogens; and memory cells are formed. Nish et al. find that these responses are impaired in the mutant mice.
To understand why, Nish et al. turn to T regulatory cells; these are adaptive immune cells that control the strength of the immune response. These experiments show that when T cells are ‘blind’ to IL-6, they are more sensitive to the action of T regulatory cells, and this disturbs the delicate balance between the stimulation and inhibition of the immune system. Nish et al. go on to show that IL-6 works together with another signaling molecule, Interleukin 1, to regulate how the T cells respond. The work helps to explain how the adaptive immune system mounts an immune response against pathogens but not against the host's own tissues.
PMCID: PMC4046568  PMID: 24842874
cytokines; T cells; regulatory T cells; memory; mouse
7.  Characterization of Multidrug-Resistant Salmonella enterica Serovars Indiana and Enteritidis from Chickens in Eastern China 
PLoS ONE  2014;9(5):e96050.
A total of 310 Salmonella isolates were isolated from 6 broiler farms in Eastern China, serotyped according to the Kauffmann-White classification. All isolates were examined for susceptibility to 17 commonly used antimicrobial agents, representative isolates were examined for resistance genes and class I integrons using PCR technology. Clonality was determined by pulsed-field gel electrophoresis (PFGE). There were two serotypes detected in the 310 Salmonella strains, which included 133 Salmonella enterica serovar Indiana isolates and 177 Salmonella enterica serovar Enteritidis isolates. Antimicrobial sensitivity results showed that the isolates were generally resistant to sulfamethoxazole, ampicillin, tetracycline, doxycycline and trimethoprim, and 95% of the isolates sensitive to amikacin and polymyxin. Among all Salmonella enterica serovar Indiana isolates, 108 (81.2%) possessed the blaTEM, floR, tetA, strA and aac (6')-Ib-cr resistance genes. The detected carriage rate of class 1 integrons was 66.5% (206/310), with 6 strains carrying gene integron cassette dfr17-aadA5. The increasing frequency of multidrug resistance rate in Salmonella was associated with increasing prevalence of int1 genes (rs = 0.938, P = 0.00039). The int1, blaTEM, floR, tetA, strA and aac (6')-Ib-cr positive Salmonella enterica serovar Indiana isolates showed five major patterns as determined by PFGE. Most isolates exhibited the common PFGE patterns found from the chicken farms, suggesting that many multidrug-resistant isolates of Salmonella enterica serovar Indiana prevailed in these sources. Some isolates with similar antimicrobial resistance patterns represented a variety of Salmonella enterica serovar Indiana genotypes, and were derived from a different clone.
PMCID: PMC4008530  PMID: 24788434
8.  On Estimation of Allele Frequencies via Next-Generation DNA Resequencing with Barcoding 
Statistics in biosciences  2013;5(1):26-53.
Next Generation Sequencing (NGS) has revolutionized biomedical research in recent years. It is now commonly used to identify rare variants through re-sequencing individual genomes. Due to the cost of NGS, researchers have considered pooling samples as a cost-effective alternative to individual sequencing. In this article, we consider the estimation of allele frequencies of rare variants through the NGS technologies with pooled DNA samples with or without barcodes. We consider three methods for estimating allele frequencies from such data, including raw sequencing counts, inferred genotypes, and expected minor allele counts and compare their performance. Our simulation results suggest that the estimator based on inferred genotypes overall performs better than or as well as the other two estimators. When the sequencing coverage is low, biases and MSEs can be sensitive to the choice of the prior probabilities of genotypes for the estimators based on inferred genotypes and expected minor allele counts so that more accurate specification of prior probabilities is critical to lower biases and MSEs. Our study shows that the optimal number of barcodes in a pool is relatively robust to the frequencies of rare variants at a specific coverage depth. We provide general guidelines on using DNA pooling with barcoding for the estimation of allele frequencies of rare variants.
PMCID: PMC3666873  PMID: 23730349
9.  Rare deleterious mutations of the gene EFR3A in autism spectrum disorders 
Molecular Autism  2014;5:31.
Whole-exome sequencing studies in autism spectrum disorder (ASD) have identified de novo mutations in novel candidate genes, including the synaptic gene Eighty-five Requiring 3A (EFR3A). EFR3A is a critical component of a protein complex required for the synthesis of the phosphoinositide PtdIns4P, which has a variety of functions at the neural synapse. We hypothesized that deleterious mutations in EFR3A would be significantly associated with ASD.
We conducted a large case/control association study by deep resequencing and analysis of whole-exome data for coding and splice site variants in EFR3A. We determined the potential impact of these variants on protein structure and function by a variety of conservation measures and analysis of the Saccharomyces cerevisiae Efr3 crystal structure. We also analyzed the expression pattern of EFR3A in human brain tissue.
Rare nonsynonymous mutations in EFR3A were more common among cases (16 / 2,196 = 0.73%) than matched controls (12 / 3,389 = 0.35%) and were statistically more common at conserved nucleotides based on an experiment-wide significance threshold (P = 0.0077, permutation test). Crystal structure analysis revealed that mutations likely to be deleterious were also statistically more common in cases than controls (P = 0.017, Fisher exact test). Furthermore, EFR3A is expressed in cortical neurons, including pyramidal neurons, during human fetal brain development in a pattern consistent with ASD-related genes, and it is strongly co-expressed (P < 2.2 × 10−16, Wilcoxon test) with a module of genes significantly associated with ASD.
Rare deleterious mutations in EFR3A were found to be associated with ASD using an experiment-wide significance threshold. Synaptic phosphoinositide metabolism has been strongly implicated in syndromic forms of ASD. These data for EFR3A strengthen the evidence for the involvement of this pathway in idiopathic autism.
PMCID: PMC4032628  PMID: 24860643
Autism spectrum disorder; Genetics; Rare variants; EFR3A; Synapse; Phosphoinositide metabolism
10.  Accounting for non-genetic factors by low-rank representation and sparse regression for eQTL mapping 
Bioinformatics  2013;29(8):1026-1034.
Motivation: Expression quantitative trait loci (eQTL) studies investigate how gene expression levels are affected by DNA variants. A major challenge in inferring eQTL is that a number of factors, such as unobserved covariates, experimental artifacts and unknown environmental perturbations, may confound the observed expression levels. This may both mask real associations and lead to spurious association findings.
Results: In this article, we introduce a LOw-Rank representation to account for confounding factors and make use of Sparse regression for eQTL mapping (LORS). We integrate the low-rank representation and sparse regression into a unified framework, in which single-nucleotide polymorphisms and gene probes can be jointly analyzed. Given the two model parameters, our formulation is a convex optimization problem. We have developed an efficient algorithm to solve this problem and its convergence is guaranteed. We demonstrate its ability to account for non-genetic effects using simulation, and then apply it to two independent real datasets. Our results indicate that LORS is an effective tool to account for non-genetic effects. First, our detected associations show higher consistency between studies than recently proposed methods. Second, we have identified some new hotspots that can not be identified without accounting for non-genetic effects.
Availability: The software is available at:
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC3624800  PMID: 23419377
11.  Statistical properties on semiparametric regression for evaluating pathway effects 
Most statistical methods for microarray data analysis consider one gene at a time, and they may miss subtle changes at the single gene level. This limitation may be overcome by considering a set of genes simultaneously where the gene sets are derived from prior biological knowledge. We call a pathway as a predefined set of genes that serve a particular cellular or physiological function. Limited work has been done in the regression settings to study the effects of clinical covariates and expression levels of genes in a pathway on a continuous clinical outcome. A semiparametric regression approach for identifying pathways related to a continuous outcome was proposed by Liu et al. (2007), who demonstrated the connection between a least squares kernel machine for nonparametric pathway effect and a restricted maximum likelihood (REML) for variance components. However, the asymptotic properties on a semiparametric regression for identifying pathway have never been studied. In this paper, we study the asymptotic properties of the parameter estimates on semiparametric regression and compare Liu et al.’s REML with our REML obtained from a profile likelihood. We prove that both approaches provide consistent estimators, have n convergence rate under regularity conditions, and have either an asymptotically normal distribution or a mixture of normal distributions. However, the estimators based on our REML obtained from a profile likelihood have a theoretically smaller mean squared error than those of Liu et al.’s REML. Simulation study supports this theoretical result. A profile restricted likelihood ratio test is also provided for the non-standard testing problem. We apply our approach to a type II diabetes data set (Mootha et al., 2003).
PMCID: PMC3763850  PMID: 24014933
Gaussian random process; Kernel machine; Mixed model; Pathway analysis; Profile likelihood; Restricted maximum likelihood
12.  A graph theoretic approach to utilizing protein structure to identify non-random somatic mutations 
BMC Bioinformatics  2014;15:86.
It is well known that the development of cancer is caused by the accumulation of somatic mutations within the genome. For oncogenes specifically, current research suggests that there is a small set of "driver" mutations that are primarily responsible for tumorigenesis. Further, due to recent pharmacological successes in treating these driver mutations and their resulting tumors, a variety of approaches have been developed to identify potential driver mutations using methods such as machine learning and mutational clustering. We propose a novel methodology that increases our power to identify mutational clusters by taking into account protein tertiary structure via a graph theoretical approach.
We have designed and implemented GraphPAC (Graph Protein Amino acid Clustering) to identify mutational clustering while considering protein spatial structure. Using GraphPAC, we are able to detect novel clusters in proteins that are known to exhibit mutation clustering as well as identify clusters in proteins without evidence of prior clustering based on current methods. Specifically, by utilizing the spatial information available in the Protein Data Bank (PDB) along with the mutational data in the Catalogue of Somatic Mutations in Cancer (COSMIC), GraphPAC identifies new mutational clusters in well known oncogenes such as EGFR and KRAS. Further, by utilizing graph theory to account for the tertiary structure, GraphPAC discovers clusters in DPP4, NRP1 and other proteins not identified by existing methods. The R package is available at:
GraphPAC provides an alternative to iPAC and an extension to current methodology when identifying potential activating driver mutations by utilizing a graph theoretic approach when considering protein tertiary structure.
PMCID: PMC4024121  PMID: 24669769
13.  An international effort towards developing standards for best practices in analysis, interpretation and reporting of clinical genome sequencing results in the CLARITY Challenge 
Brownstein, Catherine A | Beggs, Alan H | Homer, Nils | Merriman, Barry | Yu, Timothy W | Flannery, Katherine C | DeChene, Elizabeth T | Towne, Meghan C | Savage, Sarah K | Price, Emily N | Holm, Ingrid A | Luquette, Lovelace J | Lyon, Elaine | Majzoub, Joseph | Neupert, Peter | McCallie Jr, David | Szolovits, Peter | Willard, Huntington F | Mendelsohn, Nancy J | Temme, Renee | Finkel, Richard S | Yum, Sabrina W | Medne, Livija | Sunyaev, Shamil R | Adzhubey, Ivan | Cassa, Christopher A | de Bakker, Paul IW | Duzkale, Hatice | Dworzyński, Piotr | Fairbrother, William | Francioli, Laurent | Funke, Birgit H | Giovanni, Monica A | Handsaker, Robert E | Lage, Kasper | Lebo, Matthew S | Lek, Monkol | Leshchiner, Ignaty | MacArthur, Daniel G | McLaughlin, Heather M | Murray, Michael F | Pers, Tune H | Polak, Paz P | Raychaudhuri, Soumya | Rehm, Heidi L | Soemedi, Rachel | Stitziel, Nathan O | Vestecka, Sara | Supper, Jochen | Gugenmus, Claudia | Klocke, Bernward | Hahn, Alexander | Schubach, Max | Menzel, Mortiz | Biskup, Saskia | Freisinger, Peter | Deng, Mario | Braun, Martin | Perner, Sven | Smith, Richard JH | Andorf, Janeen L | Huang, Jian | Ryckman, Kelli | Sheffield, Val C | Stone, Edwin M | Bair, Thomas | Black-Ziegelbein, E Ann | Braun, Terry A | Darbro, Benjamin | DeLuca, Adam P | Kolbe, Diana L | Scheetz, Todd E | Shearer, Aiden E | Sompallae, Rama | Wang, Kai | Bassuk, Alexander G | Edens, Erik | Mathews, Katherine | Moore, Steven A | Shchelochkov, Oleg A | Trapane, Pamela | Bossler, Aaron | Campbell, Colleen A | Heusel, Jonathan W | Kwitek, Anne | Maga, Tara | Panzer, Karin | Wassink, Thomas | Van Daele, Douglas | Azaiez, Hela | Booth, Kevin | Meyer, Nic | Segal, Michael M | Williams, Marc S | Tromp, Gerard | White, Peter | Corsmeier, Donald | Fitzgerald-Butt, Sara | Herman, Gail | Lamb-Thrush, Devon | McBride, Kim L | Newsom, David | Pierson, Christopher R | Rakowsky, Alexander T | Maver, Aleš | Lovrečić, Luca | Palandačić, Anja | Peterlin, Borut | Torkamani, Ali | Wedell, Anna | Huss, Mikael | Alexeyenko, Andrey | Lindvall, Jessica M | Magnusson, Måns | Nilsson, Daniel | Stranneheim, Henrik | Taylan, Fulya | Gilissen, Christian | Hoischen, Alexander | van Bon, Bregje | Yntema, Helger | Nelen, Marcel | Zhang, Weidong | Sager, Jason | Zhang, Lu | Blair, Kathryn | Kural, Deniz | Cariaso, Michael | Lennon, Greg G | Javed, Asif | Agrawal, Saloni | Ng, Pauline C | Sandhu, Komal S | Krishna, Shuba | Veeramachaneni, Vamsi | Isakov, Ofer | Halperin, Eran | Friedman, Eitan | Shomron, Noam | Glusman, Gustavo | Roach, Jared C | Caballero, Juan | Cox, Hannah C | Mauldin, Denise | Ament, Seth A | Rowen, Lee | Richards, Daniel R | Lucas, F Anthony San | Gonzalez-Garay, Manuel L | Caskey, C Thomas | Bai, Yu | Huang, Ying | Fang, Fang | Zhang, Yan | Wang, Zhengyuan | Barrera, Jorge | Garcia-Lobo, Juan M | González-Lamuño, Domingo | Llorca, Javier | Rodriguez, Maria C | Varela, Ignacio | Reese, Martin G | De La Vega, Francisco M | Kiruluta, Edward | Cargill, Michele | Hart, Reece K | Sorenson, Jon M | Lyon, Gholson J | Stevenson, David A | Bray, Bruce E | Moore, Barry M | Eilbeck, Karen | Yandell, Mark | Zhao, Hongyu | Hou, Lin | Chen, Xiaowei | Yan, Xiting | Chen, Mengjie | Li, Cong | Yang, Can | Gunel, Murat | Li, Peining | Kong, Yong | Alexander, Austin C | Albertyn, Zayed I | Boycott, Kym M | Bulman, Dennis E | Gordon, Paul MK | Innes, A Micheil | Knoppers, Bartha M | Majewski, Jacek | Marshall, Christian R | Parboosingh, Jillian S | Sawyer, Sarah L | Samuels, Mark E | Schwartzentruber, Jeremy | Kohane, Isaac S | Margulies, David M
Genome Biology  2014;15(3):R53.
There is tremendous potential for genome sequencing to improve clinical diagnosis and care once it becomes routinely accessible, but this will require formalizing research methods into clinical best practices in the areas of sequence data generation, analysis, interpretation and reporting. The CLARITY Challenge was designed to spur convergence in methods for diagnosing genetic disease starting from clinical case history and genome sequencing data. DNA samples were obtained from three families with heritable genetic disorders and genomic sequence data were donated by sequencing platform vendors. The challenge was to analyze and interpret these data with the goals of identifying disease-causing variants and reporting the findings in a clinically useful format. Participating contestant groups were solicited broadly, and an independent panel of judges evaluated their performance.
A total of 30 international groups were engaged. The entries reveal a general convergence of practices on most elements of the analysis and interpretation process. However, even given this commonality of approach, only two groups identified the consensus candidate variants in all disease cases, demonstrating a need for consistent fine-tuning of the generally accepted methods. There was greater diversity of the final clinical report content and in the patient consenting process, demonstrating that these areas require additional exploration and standardization.
The CLARITY Challenge provides a comprehensive assessment of current practices for using genome sequencing to diagnose and report genetic diseases. There is remarkable convergence in bioinformatic techniques, but medical interpretation and reporting are areas that require further development by many groups.
PMCID: PMC4073084  PMID: 24667040
14.  A genome wide association study of plasma uric acid levels in obese cases and never-overweight controls 
Obesity (Silver Spring, Md.)  2013;21(9):E490-E494.
To identify plasma uric acid related genes in extremely obese and normal weight individuals using genome wide association studies (GWAS).
Design and Methods
Using genotypes from a GWAS focusing on obesity and thinness, we performed quantitative trait association analyses (PLINK) for plasma uric acid levels in 1,060 extremely obese individuals [body mass index (BMI) >35 kg/m2] and normal-weight controls (BMI<25kg/m2). In 961 samples with uric acid data, 924 were females.
Significant associations were found in SLC2A9 gene SNPs and plasma uric acid levels (rs6449213, P=3.15×10−12). DIP2C gene SNP rs877282 also reached genome wide significance(P=4,56×10−8). Weaker associations (P<1×10−5) were found in F5, PXDNL, FRAS1, LCORL, and MICAL2genes. Besides SLC2A9, 3 previously identified uric acid related genes ABCG2 (rs2622605, P=0.0026), SLC17A1 (rs3799344, P=0.0017), and RREB1 (rs1615495, P =0.00055) received marginal support in our study.
Two genes/chromosome regions reached genome wide association significance (P< 1× 10−7, 550K SNPs) in our GWAS : SLC2A9, the chromosome 2 60.1 Mb region (rs6723995), and the DIP2C gene region. Five other genes (F5, PXDNL, FRAS1, LCORL, and MICAL2) yielded P<1× 10−5. Four previous reported associations were replicated in our study, including SLC2A9, ABCG2, RREB, and SLC17A1.
PMCID: PMC3762924  PMID: 23703922
uric acid; genome wide association study; obesity
15.  Sparse Estimation of Conditional Graphical Models With Application to Gene Networks 
In many applications the graph structure in a network arises from two sources: intrinsic connections and connections due to external effects. We introduce a sparse estimation procedure for graphical models that is capable of isolating the intrinsic connections by removing the external effects. Technically, this is formulated as a conditional graphical model, in which the external effects are modeled as predictors, and the graph is determined by the conditional precision matrix. We introduce two sparse estimators of this matrix using the reproduced kernel Hilbert space combined with lasso and adaptive lasso. We establish the sparsity, variable selection consistency, oracle property, and the asymptotic distributions of the proposed estimators. We also develop their convergence rate when the dimension of the conditional precision matrix goes to infinity. The methods are compared with sparse estimators for unconditional graphical models, and with the constrained maximum likelihood estimate that assumes a known graph structure. The methods are applied to a genetic data set to construct a gene network conditioning on single-nucleotide polymorphisms.
PMCID: PMC3932550  PMID: 24574574
Conditional random field; Gaussian graphical models; Lasso and adaptive lasso; Oracle property; Reproducing kernel Hilbert space; Sparsity; Sparsistency; von Mises expansion
16.  Sparse principal component analysis by choice of norm 
Recent years have seen the developments of several methods for sparse principal component analysis due to its importance in the analysis of high dimensional data. Despite the demonstration of their usefulness in practical applications, they are limited in terms of lack of orthogonality in the loadings (coefficients) of different principal components, the existence of correlation in the principal components, the expensive computation needed, and the lack of theoretical results such as consistency in high-dimensional situations. In this paper, we propose a new sparse principal component analysis method by introducing a new norm to replace the usual norm in traditional eigenvalue problems, and propose an efficient iterative algorithm to solve the optimization problems. With this method, we can efficiently obtain uncorrelated principal components or orthogonal loadings, and achieve the goal of explaining a high percentage of variations with sparse linear combinations. Due to the strict convexity of the new norm, we can prove the convergence of the iterative method and provide the detailed characterization of the limits. We also prove that the obtained principal component is consistent for a single component model in high dimensional situations. As illustration, we apply this method to real gene expression data with competitive results.
PMCID: PMC3601508  PMID: 23524453
sparse principal component analysis; high-dimensional data; uncorrelated or orthogonal principal components; iterative algorithm; consistency in high-dimensional
17.  Cytogenomic mapping and bioinformatic mining reveal interacting brain expressed genes for intellectual disability 
Microarray analysis has been used as the first-tier genetic testing to detect chromosomal imbalances and copy number variants (CNVs) for pediatric patients with intellectual and developmental disabilities (ID/DD). To further investigate the candidate genes and underlying dosage-sensitive mechanisms related to ID, cytogenomic mapping of critical regions and bioinformatic mining of candidate brain-expressed genes (BEGs) and their functional interactions were performed. Critical regions of chromosomal imbalances and pathogenic CNVs were mapped by subtracting known benign CNVs from the Databases of Genomic Variants (DGV) and extracting smallest overlap regions with cases from DatabasE of Chromosomal Imbalance and Phenotype in Humans using Ensembl Resources (DECIPHER). BEGs from these critical regions were revealed by functional annotation using Database for Annotation, Visualization, and Integrated Discovery (DAVID) and by tissue expression pattern from Uniprot. Cross-region interrelations and functional networks of the BEGs were analyzed using Gene Relationships Across Implicated Loci (GRAIL) and Ingenuity Pathway Analysis (IPA).
Of the 1,354 patients analyzed by oligonucleotide array comparative genomic hybridization (aCGH), pathogenic abnormalities were detected in 176 patients including genomic disorders in 66 patients (37.5%), subtelomeric rearrangements in 45 patients (25.6%), interstitial imbalances in 33 patients (18.8%), chromosomal structural rearrangements in 17 patients (9.7%) and aneuploidies in 15 patients (8.5%). Subtractive and extractive mapping defined 82 disjointed critical regions from the detected abnormalities. A total of 461 BEGs was generated from 73 disjointed critical regions. Enrichment of central nervous system specific genes in these regions was noted. The number of BEGs increased with the size of the regions. A list of 108 candidate BEGs with significant cross region interrelation was identified by GRAIL and five significant gene networks involving cell cycle, cell-to-cell signaling, cellular assembly, cell morphology, and gene expression regulations were denoted by IPA.
These results characterized ID related cross-region interrelations and multiple networks of candidate BEGs from the detected genomic imbalances. Further experimental study of these BEGs and their interactions will lead to a better understanding of dosage-sensitive mechanisms and modifying effects of human mental development.
PMCID: PMC3905969  PMID: 24410907
Intellectual disability; Critical regions; Brain expressed genes; Cross-region gene interrelation; Functional network
18.  Array-based Profiling of DNA Methylation Changes Associated with Alcohol Dependence 
Epigenetic regulation through DNA methylation may influence vulnerability to numerous disorders, including alcohol dependence (AD).
Peripheral blood DNA methylation levels of 384 CpGs in the promoter regions of 82 candidate genes were examined in 285 African Americans (AAs; 141 AD cases and 144 controls) and 249 European Americans (EAs; 144 AD cases and 105 controls) using Illumina GoldenGate Methylation Array assays. Association of AD and DNA methylation changes were analyzed using multivariate analyses of covariance with frequency of intoxication, sex, age and ancestry proportion as covariates. CpGs showing significant methylation alterations in AD cases were further examined in a replication sample (49 EA cases and 32 EA controls) using Sequenom’s MassARRAY EpiTYPER technology.
In AAs, two CpGs in two genes (GABRB3 and POMC) were hypermethylated in AD cases compared to controls (P≤0.001). In EAs, six CpGs in six genes (HTR3A, NCAM1, DRD4, MBD3, HTR2B and GRIN1) were hypermethylated in AD cases compared to controls (P≤0.001); CpG cg08989585 in the HTR3A promoter region showed a significantly higher methylation level in EA cases than in EA controls after Bonferroni correction (P=0.00007). Additionally, methylation levels of six CpGs (including cg08989585) in the HTR3A promoter region were analyzed in the replication sample. Although the six HTR3A promoter CpGs did not show significant methylation differences between EA cases and EA controls (P=0.067–0.877), the methylation level of CpG cg08989585 was non-significantly higher in EA cases (26.9%) than in EA controls (18.6%) (P=0.139).
The findings from this study suggest that DNA methylation profile appears to be associated with AD in a population-specific way and the predisposition to AD may result from a complex interplay of genetic variation and epigenetic modifications.
PMCID: PMC3511647  PMID: 22924764
Illumna GoldenGate Methylation Array; Sequenom MassARRAY EpiTYPER; Promoter CpGs; Alcohol Dependence; Peripheral Blood DNA
19.  Estimating the Proportion of True Null Hypotheses Using the Pattern of Observed p-values 
Journal of applied statistics  2013;40(9):1949-1964.
Estimating the proportion of true null hypotheses, π0, has attracted much attention in the recent statistical literature. Besides its apparent relevance for a set of specific scientific hypotheses, an accurate estimate of this parameter is key for many multiple testing procedures. Most existing methods for estimating π0 in the literature are motivated from the independence assumption of test statistics, which is often not true in reality. Simulations indicate that most existing estimators in the presence of the dependence among test statistics can be poor, mainly due to the increase of variation in these estimators. In this paper, we propose several data-driven methods for estimating π0 by incorporating the distribution pattern of the observed p-values as a practical approach to address potential dependence among test statistics. Specifically, we use a linear fit to give a data-driven estimate for the proportion of true-null p-values in (λ, 1] over the whole range [0, 1] instead of using the expected proportion at 1 − λ. We find that the proposed estimators may substantially decrease the variance of the estimated true null proportion and thus improve the overall performance.
PMCID: PMC3781956  PMID: 24078762
gene expression data; multiple testing; proportion of true null hypotheses; p-value
20.  Extended haplotype association study in Crohn’s disease identifies a novel, Ashkenazi Jewish-specific missense mutation in the NF-κB pathway gene, HEATR3 
Genes and immunity  2013;14(5):310-316.
The Ashkenazi Jewish population has a several-fold higher prevalence of Crohn’s disease compared to non-Jewish European ancestry populations and has a unique genetic history. Haplotype association is critical to Crohn’s disease etiology in this population, most notably at NOD2, in which three causal, uncommon, and conditionally independent NOD2 variants reside on a shared background haplotype. We present an analysis of extended haplotypes which showed significantly greater association to Crohn’s disease in the Ashkenazi Jewish population compared to a non-Jewish population (145 haplotypes and no haplotypes with P-value < 10−3, respectively). Two haplotype regions, one each on chromosomes 16 and 21, conferred increased disease risk within established Crohn’s disease loci. We performed exome sequencing of 55 Ashkenazi Jewish individuals and follow-up genotyping focused on variants in these two regions. We observed Ashkenazi Jewish-specific nominal association at R755C in TRPM2 on chromosome 21. Within the chromosome 16 region, R642S of HEATR3 and rs9922362 of BRD7 showed genome-wide significance. Expression studies of HEATR3 demonstrated a positive role in NOD2-mediated NF-κB signaling. The BRD7 signal showed conditional dependence with only the downstream rare Crohn’s disease-causal variants in NOD2, but not with the background haplotype; this elaborates NOD2 as a key illustration of synthetic association.
PMCID: PMC3785105  PMID: 23615072
haplotype association; Ashkenazi Jewish; Crohn’s disease; NF-κB signaling; synthetic association
21.  Joint conditional Gaussian graphical models with multiple sources of genomic data 
Frontiers in Genetics  2013;4:294.
It is challenging to identify meaningful gene networks because biological interactions are often condition-specific and confounded with external factors. It is necessary to integrate multiple sources of genomic data to facilitate network inference. For example, one can jointly model expression datasets measured from multiple tissues with molecular marker data in so-called genetical genomic studies. In this paper, we propose a joint conditional Gaussian graphical model (JCGGM) that aims for modeling biological processes based on multiple sources of data. This approach is able to integrate multiple sources of information by adopting conditional models combined with joint sparsity regularization. We apply our approach to a real dataset measuring gene expression in four tissues (kidney, liver, heart, and fat) from recombinant inbred rats. Our approach reveals that the liver tissue has the highest level of tissue-specific gene regulations among genes involved in insulin responsive facilitative sugar transporter mediated glucose transport pathway, followed by heart and fat tissues, and this finding can only be attained from our JCGGM approach.
PMCID: PMC3865369  PMID: 24381584
Gaussian graphical models; gene networks; GGMs; conditional GGMs; joint sparsity
23.  De novo mutations in histone modifying genes in congenital heart disease 
Nature  2013;498(7453):220-223.
Congenital heart disease (CHD) is the most frequent birth defect, affecting 0.8% of live births1. Many cases occur sporadically and impair reproductive fitness, suggesting a role for de novo mutations. By analysis of exome sequencing of parent-offspring trios, we compared the incidence of de novo mutations in 362 severe CHD cases and 264 controls. CHD cases showed a significant excess of protein-altering de novo mutations in genes expressed in the developing heart, with an odds ratio of 7.5 for damaging mutations. Similar odds ratios were seen across major classes of severe CHD. We found a marked excess of de novo mutations in genes involved in production, removal or reading of H3K4 methylation (H3K4me), or ubiquitination of H2BK120, which is required for H3K4 methylation2–4. There were also two de novo mutations in SMAD2; SMAD2 signaling in the embryonic left-right organizer induces demethylation of H3K27me5. H3K4me and H3K27me mark `poised' promoters and enhancers that regulate expression of key developmental genes6. These findings implicate de novo point mutations in several hundred genes that collectively contribute to ~10% of severe CHD.
PMCID: PMC3706629  PMID: 23665959
24.  A review of post-GWAS prioritization approaches 
Frontiers in Genetics  2013;4:280.
In the recent decade, high-throughput genotyping and next-generation sequencing platforms have enabled genome-wide association studies (GWAS) of many complex human diseases. These studies have discovered many disease susceptible loci, and unveiled unexpected disease mechanisms. Despite these successes, these identified variants only explain a small proportion of the genetic contributions to these diseases and many more remain to be found. This is largely due to the small effect sizes of most disease-associated variants and limited sample size. As a result, it is critical to leverage other information to more effectively prioritize GWAS signals to increase replication rates and better understand disease mechanisms. In this review, we introduce the biological/genomic features that have been found to be informative for post-GWAS prioritization, and discuss available tools to utilize these features for prioritization
PMCID: PMC3856625  PMID: 24367376
genome-wide association studies; prioritization; eQTL; DNase I hypersensitive site; non-coding
25.  SomatiCA: Identifying, Characterizing and Quantifying Somatic Copy Number Aberrations from Cancer Genome Sequencing Data 
PLoS ONE  2013;8(11):e78143.
Whole genome sequencing of matched tumor-normal sample pairs is becoming routine in cancer research. However, analysis of somatic copy-number changes from sequencing data is still challenging because of insufficient sequencing coverage, unknown tumor sample purity and subclonal heterogeneity. Here we describe a computational framework, named SomatiCA, which explicitly accounts for tumor purity and subclonality in the analysis of somatic copy-number profiles. Taking read depths (RD) and lesser allele frequencies (LAF) as input, SomatiCA will output 1) admixture rate for each tumor sample, 2) somatic allelic copy-number for each genomic segment, 3) fraction of tumor cells with subclonal change in each somatic copy number aberration (SCNA), and 4) a list of substantial genomic aberration events including gain, loss and LOH. SomatiCA is available as a Bioconductor R package at
PMCID: PMC3827077  PMID: 24265680

