Chronic alcohol consumption may induce gene expression alterations in brain reward regions such as the prefrontal cortex (PFC), modulating the risk of alcohol use disorders (AUDs). Transcriptome profiles of 23 AUD cases and 23 matched controls (16 pairs of males and 7 pairs of females) in postmortem PFC were generated using Illumina’s HumanHT-12 v4 Expression BeadChip. Probe-level differentially expressed genes and gene modules in AUD subjects were identified using multiple linear regression and weighted gene co-expression network analyses. The enrichment of differentially co-expressed genes in alcohol dependence-associated genes identified by genome-wide association studies (GWAS) was examined using gene set enrichment analysis. Biological pathways overrepresented by differentially co-expressed genes were uncovered using DAVID bioinformatics resources. Three AUD-associated gene modules in males [Module 1 (561 probes mapping to 505 genes): r=0.42, Pcorrelation=0.020; Module 2 (815 probes mapping to 713 genes): r=0.41, Pcorrelation=0.020; Module 3 (1,446 probes mapping to 1,305 genes): r=−0.38, Pcorrelation=0.030] and one AUD-associated gene module in females [Module 4 (683 probes mapping to 652 genes): r=0.64, Pcorrelation=0.010] were identified. Differentially expressed genes mapped by significant expression probes (Pnominal≤0.05) clustered in Modules 1 and 2 were enriched in GWAS-identified alcohol dependence-associated genes [Module 1 (134 genes): P=0.028; Module 2 (243 genes): P=0.004]. These differentially expressed genes, including ALDH2, ALDH7A1, and ALDH9A1, are involved in cellular functions such as aldehyde detoxification, mitochondrial function, and fatty acid metabolism. Our study revealed differentially co-expressed genes in postmortem PFC of AUD subjects and demonstrated that some of these differentially co-expressed genes participate in alcohol metabolism.
Alcohol use disorders; postmortem prefrontal cortex; genome-wide gene expression; co-expression; gene set enrichment analysis; biological pathways
Cancer biomarker discovery can facilitate drug development, improve staging of patients, and predict patient prognosis. Because cancer is the result of many interacting genes, analysis based on a set of genes with related biological functions or pathways may be more informative than single gene-based analysis for cancer biomarker discovery. The relevant pathways thus identified may help characterize different aspects of molecular phenotypes related to the tumor. Although it is well known that cancer patients may respond to the same treatment differently because of clinical variables and variation of molecular phenotypes, this patient heterogeneity has not been explicitly considered in pathway analysis in the literature. We hypothesize that combining pathway and patient clinical information can more effectively identify relevant pathways pertinent to specific patient subgroups, leading to better diagnosis and treatment. In this article, we propose to perform stratified pathway analysis based on clinical information from patients. In contrast to analysis using all the patients, this more focused analysis has the potential to reveal subgroup-specific pathways that may lead to more biological insights into disease etiology and treatment response. As an illustration, the power of our approach is demonstrated through its application to a breast cancer dataset in which the patients are stratified according to their oral contraceptive use.
cancer; random forests; pathways; progesterone receptor
We report a GWAS for cocaine dependence (CD) in three sets of African- and European-American subjects (AAs and EAs, respectively), to identify pathways, genes, and alleles important in CD risk.
The discovery GWAS dataset (n=5,697 subjects) was genotyped using the Illumina OmniQuad microarray (890,000 analyzed SNPs). Additional genotypes were imputed based on the 1000 Genomes reference panel. Top-ranked findings were evaluated by incorporating information from publicly available GWAS data from 4,063 subjects. Then, the most significant GWAS SNPs were genotyped in 2,549 independent subjects.
We observed one genomewide-significant (GWS) result: rs7086629 at the FAM53B (“family with sequence similarity 53, member B”) locus. This was supported in both AAs and EAs; p-value (meta-analysis of all samples) =4.28×10−8. The gene maps to the same chromosomal region as the maximum peak we observed in a previous linkage study. NCOR2 (nuclear receptor corepressor 1) SNP rs150954431 was associated with p=1.19×10−9 in the EA discovery sample. SNP rs2456778, which maps to CDK1 (“cyclin-dependent kinase 1”), was associated with cocaine-induced paranoia in AAs in the discovery sample only (p=4.68×10−8).
This is the first study to identify risk variants for CD using GWAS. Our results implicate novel risk loci and provide insights into potential therapeutic and prevention strategies.
Cocaine dependence; cocaine-induced paranoia; GWAS; population genetics; European-American and African-American populations
Many areas critical to agricultural production and research, such as the breeding and trait mapping in plants and livestock, require robust and scalable genotyping platforms. Genotyping-by-sequencing (GBS) is a one such method highly suited to non-human organisms. In the GBS protocol, genomic DNA is fractionated via restriction digest, then reduced representation is achieved through size selection. Since many restriction sites are conserved across a species, the sequenced portion of the genome is highly consistent within a population. This makes the GBS protocol highly suited for experiments that require surveying large numbers of markers within a population, such as those involving genetic mapping, breeding, and population genomics. We have modified the GBS technology in a number of ways. Custom, enzyme specific adaptors have been replaced with standard Illumina adaptors compatible with blunt-end restriction enzymes. Multiplexing is achieved through a dual barcoding system, and bead-based library preparation protocols allows for in-solution size selection and eliminates the need for columns and gels.
A panel of eight restriction enzymes was selected for testing on B73 maize and Nipponbare rice genomic DNA. Quality of the data was demonstrated by identifying that the vast majority of reads from each enzyme aligned to restriction sites predicted in silico. The link between enzyme parameters and experimental outcome was demonstrated by showing that the sequenced portion of the genome was adaptable by selecting enzymes based on motif length, complexity, and methylation sensitivity. The utility of the new GBS protocol was demonstrated by correctly mapping several in a maize F2 population resulting from a B73 × Country Gentleman test cross.
This technology is readily adaptable to different genomes, highly amenable to multiplexing and compatible with over forty commercially available restriction enzymes. These advancements represent a major improvement in genotyping technology by providing a highly flexible and scalable GBS that is readily implemented for studies on genome-wide variation.
Electronic supplementary material
The online version of this article (doi:10.1186/1471-2164-15-979) contains supplementary material, which is available to authorized users.
Genotyping; GBS; Reduced representation sequencing; Population genomics; Trait mapping; Plant breeding; Agricultural genomics
Results from Genome-Wide Association Studies (GWAS) have shown that complex diseases are often affected by many genetic variants with small or moderate effects. Identifications of these risk variants remain a very challenging problem. There is a need to develop more powerful statistical methods to leverage available information to improve upon traditional approaches that focus on a single GWAS dataset without incorporating additional data. In this paper, we propose a novel statistical approach, GPA (Genetic analysis incorporating Pleiotropy and Annotation), to increase statistical power to identify risk variants through joint analysis of multiple GWAS data sets and annotation information because: (1) accumulating evidence suggests that different complex diseases share common risk bases, i.e., pleiotropy; and (2) functionally annotated variants have been consistently demonstrated to be enriched among GWAS hits. GPA can integrate multiple GWAS datasets and functional annotations to seek association signals, and it can also perform hypothesis testing to test the presence of pleiotropy and enrichment of functional annotation. Statistical inference of the model parameters and SNP ranking is achieved through an EM algorithm that can handle genome-wide markers efficiently. When we applied GPA to jointly analyze five psychiatric disorders with annotation information, not only did GPA identify many weak signals missed by the traditional single phenotype analysis, but it also revealed relationships in the genetic architecture of these disorders. Using our hypothesis testing framework, statistically significant pleiotropic effects were detected among these psychiatric disorders, and the markers annotated in the central nervous system genes and eQTLs from the Genotype-Tissue Expression (GTEx) database were significantly enriched. We also applied GPA to a bladder cancer GWAS data set with the ENCODE DNase-seq data from 125 cell lines. GPA was able to detect cell lines that are biologically more relevant to bladder cancer. The R implementation of GPA is currently available at http://dongjunchung.github.io/GPA/.
In the past 10 years, many genome wide association studies (GWAS) have been conducted to identify the genetic bases of complex human traits. As of January, 2014, more than 12,000 single-nucleotide polymorphisms (SNPs) have been reported to be significantly associated with at least one complex trait/disease. On one hand, about 85% of identified risk variants are located in non-coding regions, which motivates a systematic understanding of the function of non-coding variants in regulatory elements in the human genome. On the other hand, complex diseases are often affected by many genetic variants with small or moderate effects. To address these issues, we propose a statistical approach, GPA, to integrating information from multiple GWAS datasets and functional annotation. Notably, our approach only requires marker-wise p-values as input, making it especially useful when only summary statistics, instead of the full genotype and phenotype data, are available. We applied GPA to analyze GWAS datasets of five psychiatric disorders and bladder cancer, where the central nervous system genes, eQTLs from the Genotype-Tissue Expression (GTEx), and the ENCODE DNase-seq data from 125 cell lines were used as functional annotation. The analysis results suggest that GPA is an effective method for integrative data analysis in the post-GWAS era.
CD4+ T cell differentiation is regulated by specialized antigen-presenting cells. Dendritic cells (DCs) produce cytokines that promote naive CD4+ T cell differentiation into T helper 1 (Th1), Th17, and inducible T regulatory (iTreg) cells. However, the initiation of Th2 cell responses remains poorly understood, although it is likely that more than one mechanism might be involved. Here we have defined a specific DC subset that is involved in Th2 cell differentiation in vivo in response to a protease allergen, as well as infection with Nippostrongylus brasiliensis. We have demonstrated that this subset is controlled by the transcription factor interferon regulatory factor 4 (IRF4), which is required for their differentiation and Th2 cell-inducing function. IRF4 is known to control Th2 cell differentiation and Th2 cell-associated suppressing function in Treg cells. Our finding suggests that IRF4 also plays a role in DCs where it controls the initiation of Th2 cell responses.
Motivation: MicroRNAs (miRNAs) play a crucial role in tumorigenesis and development through their effects on target genes. The characterization of miRNA–gene interactions will lead to a better understanding of cancer mechanisms. Many computational methods have been developed to infer miRNA targets with/without expression data. Because expression datasets are in general limited in size, most existing methods concatenate datasets from multiple studies to form one aggregated dataset to increase sample size and power. However, such simple aggregation analysis results in identifying miRNA–gene interactions that are mostly common across datasets, whereas specific interactions may be missed by these methods. Recent releases of The Cancer Genome Atlas data provide paired expression profiling of miRNAs and genes in multiple tumors with sufficiently large sample size. To study both common and cancer-specific interactions, it is desirable to develop a method that can jointly analyze multiple cancers to study miRNA–gene interactions without combining all the data into one single dataset.
Results: We developed a novel statistical method to jointly analyze expression profiles from multiple cancers to identify miRNA–gene interactions that are both common across cancers and specific to certain cancers. The benefit of this joint analysis approach is demonstrated by both simulation studies and real data analysis of The Cancer Genome Atlas datasets. Compared with simple aggregate analysis or single sample analysis, our method can effectively use the shared information among different but related cancers to improve the identification of miRNA–gene interactions. Another useful property of our method is that it can estimate similarity among cancers through their shared miRNA–gene interactions.
Availability and implementation: The program, MCMG, implemented in R is available at http://bioinformatics.med.yale.edu/group/.
Recent developments of next generation sequencing technologies have led to rapid accumulation of 16s rRNA sequences for microbiome profiling. One key step in data processing is to cluster short sequences into operational taxonomic units (OTUs). Although many methods have been proposed for OTU inferences, a major challenge is the balance between inference accuracy and computational efficiency, where inference accuracy is often sacrificed to accommodate the need to analyze large numbers of sequences. Inspired by the hierarchical clustering method and a modified greedy network clustering algorithm, we propose a novel multi-seeds based heuristic clustering method, named MSClust, for OTU inference. MSClust first adaptively selects multi-seeds instead of one seed for each candidate cluster, and the reads are then processed using a greedy clustering strategy. Through many numerical examples, we demonstrate that MSClust enjoys less memory usage, and better biological accuracy compared to existing heuristic clustering methods while preserving efficiency and scalability.
Clustering Algorithms; Operational Taxonomy Unit (OTU); Next-generation Sequencing; Seeds-Selection; 16S rRNA Reads
Autophagy activity is essential for the survival of neural cells. Impairment of autophagy has been implicated in the pathogenesis of neurodegenerative disorders. Unlike the massive neuron loss in mice deficient for autophagy genes essential for autophagosome formation, we demonstrated that mice deficient for the metazoan-specific autophagy gene Epg5 develop selective neuronal damage and exhibit key characteristics of amyotrophic lateral sclerosis. Epg5 deficiency blocks the maturation of autophagosomes into degradative autolysosomes, slows endocytic degradation and also impairs endocytic recycling. Recessive mutations in human EPG5 have recently been causally associated with the multisystem disorder Vici syndrome. Here we show that while Epg5 knockout mice display some features of Vici syndrome, many phenotypes are absent.
autophagy; autophagosome; Epg5; Vici syndrome; neurodegeneration
Current research suggests that a small set of “driver” mutations are responsible for tumorigenesis while a larger body of “passenger” mutations occur in the tumor but do not progress the disease. Due to recent pharmacological successes in treating cancers caused by driver mutations, a variety of methodologies that attempt to identify such mutations have been developed. Based on the hypothesis that driver mutations tend to cluster in key regions of the protein, the development of cluster identification algorithms has become critical.
We have developed a novel methodology, SpacePAC (Spatial Protein Amino acid Clustering), that identifies mutational clustering by considering the protein tertiary structure directly in 3D space. By combining the mutational data in the Catalogue of Somatic Mutations in Cancer (COSMIC) and the spatial information in the Protein Data Bank (PDB), SpacePAC is able to identify novel mutation clusters in many proteins such as FGFR3 and CHRM2. In addition, SpacePAC is better able to localize the most significant mutational hotspots as demonstrated in the cases of BRAF and ALK. The R package is available on Bioconductor at: http://www.bioconductor.org/packages/release/bioc/html/SpacePAC.html.
SpacePAC adds a valuable tool to the identification of mutational clusters while considering protein tertiary structure.
Statistical modeling coupled with bioinformatics is commonly used for drug discovery. Although there exist many approaches for single target based drug design and target inference, recent years have seen a paradigm shift to system-level pharmacological research. Pathway analysis of genomics data represents one promising direction for computational inference of drug targets. This article aims at providing a comprehensive review on the evolving issues is this field, covering methodological developments, their pros and cons, as well as future research directions.
Drug target inference; pathway analysis; genomics; statistical modeling; factor model; data mining; optimization
High-throughput sequencing technology allows researchers to test associations between phenotypes and all the variants identified throughout the genome, and is especially useful for analyzing rare variants. However, the statistical power to identify phenotype-associated rare variants is very low with typical genome-wide association studies because of their low allele frequencies among unrelated individuals. In contrast, a family-based design may have more power because rare variants are more likely to be enriched in families than among unrelated individuals. Regardless, an analysis of family-based association studies needs to account appropriately for relatedness between family members. We analyzed the observed quantitative trait systolic blood pressure as well as the simulated Q1 data in the Genetic Analysis Workshop 18 data set using 4 tests: (a) a single-variant test, (b) a collapsing test, (c) a single-variant test where familial relatedness was accounted for, and (d) a collapsing test where familial relatedness was accounted for. We then compared the results of the 4 methods and observed that adjusting for familial relatedness could appropriately control the false-positive rate while maintaining reasonable power to detect several strongly associated variants/genes.
Genetic Analysis Workshop 18 provided a platform for evaluating genomic prediction power based on single-nucleotide polymorphisms from single-nucleotide polymorphism array data and sequencing data. Also, Genetic Analysis Workshop 18 provided a diverse pedigree structure to be explored in prediction. In this study, we attempted to combine pedigree information with single-nucleotide polymorphism data to predict systolic blood pressure. Our results suggested that the prediction power based on pedigree information only could be unsatisfactory. Using additional information such as single-nucleotide polymorphism genotypes would improve prediction accuracy. In particular, the improvement can be significant when there exist a few single-nucleotide polymorphisms with relatively larger effect sizes. We also compared the prediction performance based on genome-wide association study data (ie, common variants) and sequencing data (ie, common variants plus low-frequency variants). The experimental result showed that inclusion of low frequency variants could not lead to improvement of prediction accuracy.
Admixture mapping is a disease-mapping strategy to identify disease susceptibility variants in an admixed population that is a result of mating between 2 historically separated populations differing in allele frequencies and disease prevalence. With the increasing availability of high-density genotyping data generated in genome-wide association studies, it is of interest to investigate how to apply admixture mapping in the context of the genome-wide association studies and how to adjust for admixture in association tests. In this study, we first evaluated 3 different local ancestry inference methods, LAMP, LAMP-LD, and MULTIMIX. Then we applied admixture mapping analysis based on estimated local ancestry. Finally, we performed association tests with adjustment for local ancestry.
We consider an Empirical Bayes method to correct for the Winner's Curse phenomenon in genome-wide association studies. Our method utilizes the collective distribution of all odds ratios (ORs) to determine the appropriate correction for a particular single-nucleotide polymorphism (SNP). We can show that this approach is squared error optimal provided that this collective distribution is accurately estimated in its tails. To improve the performance when correcting the OR estimates for the most highly associated SNPs, we develop a second estimator that adaptively combines the Empirical Bayes estimator with a previously considered Conditional Likelihood estimator. The applications of these methods to both simulated and real data suggest improved performance in reducing selection bias.
GWAS; Empirical Bayes; Winner's Curse
With recent advances in sequencing, genotyping arrays, and imputation, GWAS now aim to identify associations with rare and uncommon genetic variants. Here, we describe and evaluate a class of statistics, generalized score statistics (GSS), that can test for an association between a group of genetic variants and a phenotype. GSS are a simple weighted sum of single-variant statistics and their cross-products. We show that the majority of statistics currently used to detect associations with rare variants are equivalent to choosing a specific set of weights within this framework. We then evaluate the power of various weighting schemes as a function of variant characteristics, such as MAF, the proportion associated with the phenotype, and the direction of effect. Ultimately, we find that two classical tests are robust and powerful, but details are provided as to when other GSS may perform favorably. The software package CRaVe is available at our website (http://dceg.cancer.gov/bb/tools/crave).
rare variants; score test; GWAS; association test
Multiple Reaction Monitoring (MRM) conducted on a triple quadrupole mass spectrometer allows researchers to quantify the expression levels of a set of target proteins. Each protein is often characterized by several unique peptides that can be detected by monitoring predetermined fragment ions, called transitions, for each peptide. Concatenating large numbers of MRM transitions into a single assay enables simultaneous quantification of hundreds of peptides and proteins. In recognition of the important role that MRM can play in hypothesis-driven research and its increasing impact on clinical proteomics, targeted proteomics such as MRM was recently selected as the Nature Method of the Year. However, there are many challenges in MRM applications, especially data pre‑processing where many steps still rely on manual inspection of each observation in practice. In this paper, we discuss an analysis pipeline to automate MRM data pre‑processing. This pipeline includes data quality assessment across replicated samples, outlier detection, identification of inaccurate transitions, and data normalization. We demonstrate the utility of our pipeline through its applications to several real MRM data sets.
multiple reaction monitoring; label-free; quality assessment; data normalization; proteomics; peptide; transition
Paraquat, a widely used herbicide, is well known to exhibit oxidative stress and lung injury. In the present study, we investigated the possible underlying mechanisms of cannabinoid receptor-2 (CB2) activation to ameliorate the proinflammatory activity induced by PQ in rats. JWH133, a CB2 agonist, was administered by intraperitoneal injection 1 h prior to PQ exposure. After PQ exposure for 4, 8, 24, and 72 h, the bronchoalveolar lavage fluid was collected to determine levels of TNF-α and IL-1β, and the arterial blood samples were collected for detection of PaO2 level. At 72 h after PQ exposure, lung tissues were collected to determine the lung wet-to-dry weight ratios, myeloperoxidase activity, lung histopathology, the protein expression level of CB2, MAPKs (ERK1/2, p38MAPK, and JNK1/2), and NF-κBp65. After rats were pretreated with JWH133, PQ-induced lung edema and lung histopathological changes were significantly attenuated. PQ-induced TNF-α and IL-1β secretion in BALF, increases of PaO2 in arterial blood, and MPO levels in the lung tissue were significantly reduced. JWH133 could efficiently activate CB2, while inhibiting MAPKs and NF-κB activation. The results suggested that activating CB2 receptor exerted protective activity against PQ-induced ALI, and it potentially contributed to the suppression of the activation of MAPKs and NF-κB pathways.
Innate immune recognition is critical for the induction of adaptive immune responses; however the underlying mechanisms remain incompletely understood. In this study, we demonstrate that T cell-specific deletion of the IL-6 receptor α chain (IL-6Rα) results in impaired Th1 and Th17 T cell responses in vivo, and a defect in Tfh function. Depletion of Tregs in these mice rescued the Th1 but not the Th17 response. Our data suggest that IL-6 signaling in effector T cells is required to overcome Treg-mediated suppression in vivo. We show that IL-6 cooperates with IL-1β to block the suppressive effect of Tregs on CD4+ T cells, at least in part by controlling their responsiveness to IL-2. In addition, although IL-6Rα-deficient T cells mount normal primary Th1 responses in the absence of Tregs, they fail to mature into functional memory cells, demonstrating a key role for IL-6 in CD4+ T cell memory formation.
The human body's ability to defend itself against pathogens relies on two distinct but connected systems: the innate and the adaptive immune systems. Innate immune cells survey their environment and use receptors located on their surface to distinguish between molecules that are harmless and molecules that stem from pathogens. When the cells of the innate immune system detect a pathogen, they secrete signaling molecules to alert adaptive immune cells to the invaders. Both sets of immune cells then mount a coordinated attack that usually kills the pathogen.
The adaptive immune system also produces memory cells that retain information about the pathogen: this allows the organism to mount a fast and efficient immune response the next time the same type of pathogen strikes. However, it is not completely understood how the innate immune system communicates with the adaptive immune system to allow these processes to take place.
One of the signaling molecules involved in the communication between different types of immune cells is a protein called Interleukin 6 (IL-6). This protein must be produced in order to trigger the immune response: however, many immune cells are able to recognize and respond to IL-6, so it has been difficult to study its impact on specific cell types.
Nish et al. have now investigated the effects of IL-6 on T cells, one of the main types of adaptive immune cell, by creating mice with T cells that are not able to recognize IL-6. The detection of pathogens by innate immune cells normally has several effects: the population of T cells increases; the T cells produce daughter cells—T helper cells—that support innate immune cells in killing pathogens; and memory cells are formed. Nish et al. find that these responses are impaired in the mutant mice.
To understand why, Nish et al. turn to T regulatory cells; these are adaptive immune cells that control the strength of the immune response. These experiments show that when T cells are ‘blind’ to IL-6, they are more sensitive to the action of T regulatory cells, and this disturbs the delicate balance between the stimulation and inhibition of the immune system. Nish et al. go on to show that IL-6 works together with another signaling molecule, Interleukin 1, to regulate how the T cells respond. The work helps to explain how the adaptive immune system mounts an immune response against pathogens but not against the host's own tissues.
cytokines; T cells; regulatory T cells; memory; mouse
A total of 310 Salmonella isolates were isolated from 6 broiler farms in Eastern China, serotyped according to the Kauffmann-White classification. All isolates were examined for susceptibility to 17 commonly used antimicrobial agents, representative isolates were examined for resistance genes and class I integrons using PCR technology. Clonality was determined by pulsed-field gel electrophoresis (PFGE). There were two serotypes detected in the 310 Salmonella strains, which included 133 Salmonella enterica serovar Indiana isolates and 177 Salmonella enterica serovar Enteritidis isolates. Antimicrobial sensitivity results showed that the isolates were generally resistant to sulfamethoxazole, ampicillin, tetracycline, doxycycline and trimethoprim, and 95% of the isolates sensitive to amikacin and polymyxin. Among all Salmonella enterica serovar Indiana isolates, 108 (81.2%) possessed the blaTEM, floR, tetA, strA and aac (6')-Ib-cr resistance genes. The detected carriage rate of class 1 integrons was 66.5% (206/310), with 6 strains carrying gene integron cassette dfr17-aadA5. The increasing frequency of multidrug resistance rate in Salmonella was associated with increasing prevalence of int1 genes (rs = 0.938, P = 0.00039). The int1, blaTEM, floR, tetA, strA and aac (6')-Ib-cr positive Salmonella enterica serovar Indiana isolates showed five major patterns as determined by PFGE. Most isolates exhibited the common PFGE patterns found from the chicken farms, suggesting that many multidrug-resistant isolates of Salmonella enterica serovar Indiana prevailed in these sources. Some isolates with similar antimicrobial resistance patterns represented a variety of Salmonella enterica serovar Indiana genotypes, and were derived from a different clone.
Next Generation Sequencing (NGS) has revolutionized biomedical research in recent years.
It is now commonly used to identify rare variants through re-sequencing individual genomes. Due to
the cost of NGS, researchers have considered pooling samples as a cost-effective alternative to
individual sequencing. In this article, we consider the estimation of allele frequencies of rare
variants through the NGS technologies with pooled DNA samples with or without barcodes. We consider
three methods for estimating allele frequencies from such data, including raw sequencing counts,
inferred genotypes, and expected minor allele counts and compare their performance. Our simulation
results suggest that the estimator based on inferred genotypes overall performs better than or as
well as the other two estimators. When the sequencing coverage is low, biases and MSEs can be
sensitive to the choice of the prior probabilities of genotypes for the estimators based on inferred
genotypes and expected minor allele counts so that more accurate specification of prior
probabilities is critical to lower biases and MSEs. Our study shows that the optimal number of
barcodes in a pool is relatively robust to the frequencies of rare variants at a specific coverage
depth. We provide general guidelines on using DNA pooling with barcoding for the estimation of
allele frequencies of rare variants.
Whole-exome sequencing studies in autism spectrum disorder (ASD) have identified de novo mutations in novel candidate genes, including the synaptic gene Eighty-five Requiring 3A (EFR3A). EFR3A is a critical component of a protein complex required for the synthesis of the phosphoinositide PtdIns4P, which has a variety of functions at the neural synapse. We hypothesized that deleterious mutations in EFR3A would be significantly associated with ASD.
We conducted a large case/control association study by deep resequencing and analysis of whole-exome data for coding and splice site variants in EFR3A. We determined the potential impact of these variants on protein structure and function by a variety of conservation measures and analysis of the Saccharomyces cerevisiae Efr3 crystal structure. We also analyzed the expression pattern of EFR3A in human brain tissue.
Rare nonsynonymous mutations in EFR3A were more common among cases (16 / 2,196 = 0.73%) than matched controls (12 / 3,389 = 0.35%) and were statistically more common at conserved nucleotides based on an experiment-wide significance threshold (P = 0.0077, permutation test). Crystal structure analysis revealed that mutations likely to be deleterious were also statistically more common in cases than controls (P = 0.017, Fisher exact test). Furthermore, EFR3A is expressed in cortical neurons, including pyramidal neurons, during human fetal brain development in a pattern consistent with ASD-related genes, and it is strongly co-expressed (P < 2.2 × 10−16, Wilcoxon test) with a module of genes significantly associated with ASD.
Rare deleterious mutations in EFR3A were found to be associated with ASD using an experiment-wide significance threshold. Synaptic phosphoinositide metabolism has been strongly implicated in syndromic forms of ASD. These data for EFR3A strengthen the evidence for the involvement of this pathway in idiopathic autism.
Autism spectrum disorder; Genetics; Rare variants; EFR3A; Synapse; Phosphoinositide metabolism
Motivation: Expression quantitative trait loci (eQTL) studies investigate how gene expression levels are affected by DNA variants. A major challenge in inferring eQTL is that a number of factors, such as unobserved covariates, experimental artifacts and unknown environmental perturbations, may confound the observed expression levels. This may both mask real associations and lead to spurious association findings.
Results: In this article, we introduce a LOw-Rank representation to account for confounding factors and make use of Sparse regression for eQTL mapping (LORS). We integrate the low-rank representation and sparse regression into a unified framework, in which single-nucleotide polymorphisms and gene probes can be jointly analyzed. Given the two model parameters, our formulation is a convex optimization problem. We have developed an efficient algorithm to solve this problem and its convergence is guaranteed. We demonstrate its ability to account for non-genetic effects using simulation, and then apply it to two independent real datasets. Our results indicate that LORS is an effective tool to account for non-genetic effects. First, our detected associations show higher consistency between studies than recently proposed methods. Second, we have identified some new hotspots that can not be identified without accounting for non-genetic effects.
Availability: The software is available at: http://bioinformatics.med.yale.edu/software.aspx.
Supplementary data are available at Bioinformatics online.
Most statistical methods for microarray data analysis consider one gene at a time, and they may miss subtle changes at the single gene level. This limitation may be overcome by considering a set of genes simultaneously where the gene sets are derived from prior biological knowledge. We call a pathway as a predefined set of genes that serve a particular cellular or physiological function. Limited work has been done in the regression settings to study the effects of clinical covariates and expression levels of genes in a pathway on a continuous clinical outcome. A semiparametric regression approach for identifying pathways related to a continuous outcome was proposed by Liu et al. (2007), who demonstrated the connection between a least squares kernel machine for nonparametric pathway effect and a restricted maximum likelihood (REML) for variance components. However, the asymptotic properties on a semiparametric regression for identifying pathway have never been studied. In this paper, we study the asymptotic properties of the parameter estimates on semiparametric regression and compare Liu et al.’s REML with our REML obtained from a profile likelihood. We prove that both approaches provide consistent estimators, have
n convergence rate under regularity conditions, and have either an asymptotically normal distribution or a mixture of normal distributions. However, the estimators based on our REML obtained from a profile likelihood have a theoretically smaller mean squared error than those of Liu et al.’s REML. Simulation study supports this theoretical result. A profile restricted likelihood ratio test is also provided for the non-standard testing problem. We apply our approach to a type II diabetes data set (Mootha et al., 2003).
Gaussian random process; Kernel machine; Mixed model; Pathway analysis; Profile likelihood; Restricted maximum likelihood