RNA viruses exist in their hosts as populations of different but related strains. The virus population, often called quasispecies, is shaped by a combination of genetic change and natural selection. Genetic change is due to both point mutations and recombination events. We present a jumping hidden Markov model that describes the generation of viral quasispecies and a method to infer its parameters from next-generation sequencing data. The model introduces position-specific probability tables over the sequence alphabet to explain the diversity that can be found in the population at each site. Recombination events are indicated by a change of state, allowing a single observed read to originate from multiple sequences. We present a specific implementation of the expectation maximization (EM) algorithm to find maximum a posteriori estimates of the model parameters and a method to estimate the distribution of viral strains in the quasispecies. The model is validated on simulated data, showing the advantage of explicitly taking the recombination process into account, and applied to reads obtained from a clinical HIV sample.
evolution; HMM; statistical models; viruses
An important component in the analysis of genome-wide association studies involves the imputation of genotypes that have not been measured directly in the studied samples. The imputation procedure uses the linkage disequilibrium (LD) structure in the population to infer the genotype of an unobserved single nucleotide polymorphism. The LD structure is normally learned from a dense genotype map of a reference population that matches the studied population. In many instances there is no reference population that exactly matches the studied population, and a natural question arises as to how to choose the reference population for the imputation. Here we present a Coalescent-based method that addresses this issue. In contrast to the current paradigm of imputation methods, our method assigns a different reference dataset for each sample in the studied population, and for each region in the genome. This allows the flexibility to account for the diversity within populations, as well as across populations. Furthermore, because our approach treats each region in the genome separately, our method is suitable for the imputation of recently admixed populations. We evaluated our method across a large set of populations and found that our choice of reference data set considerably improves the accuracy of imputation, especially for regions with low LD and for populations without a reference population available as well as for admixed populations such as the Hispanic population. Our method is generic and can potentially be incorporated in any of the available imputation methods as an add-on.
genotype imputation; coalescent; GWAS; linkage disequilibrium; weighted panel
Characterizing genetic diversity within and between populations has broad applications in studies of human disease and evolution. We propose a new approach, spatial ancestry analysis, for the modeling of genotypes in two- or three-dimensional space. In spatial ancestry analysis (SPA), we explicitly model the spatial distribution of each SNP by assigning an allele frequency as a continuous function in geographic space. We show that the explicit modeling of the allele frequency allows individuals to be localized on the map on the basis of their genetic information alone. We apply our SPA method to a European and a worldwide population genetic variation data set and identify SNPs showing large gradients in allele frequency, and we suggest these as candidate regions under selection. These regions include SNPs in the well-characterized LCT region, as well as at loci including FOXP2, OCA2 and LRP1B.
Motivation: It is becoming increasingly evident that the analysis of genotype data from recently admixed populations is providing important insights into medical genetics and population history. Such analyses have been used to identify novel disease loci, to understand recombination rate variation and to detect recent selection events. The utility of such studies crucially depends on accurate and unbiased estimation of the ancestry at every genomic locus in recently admixed populations. Although various methods have been proposed and shown to be extremely accurate in two-way admixtures (e.g. African Americans), only a few approaches have been proposed and thoroughly benchmarked on multi-way admixtures (e.g. Latino populations of the Americas).
Results: To address these challenges we introduce here methods for local ancestry inference which leverage the structure of linkage disequilibrium in the ancestral population (LAMP-LD), and incorporate the constraint of Mendelian segregation when inferring local ancestry in nuclear family trios (LAMP-HAP). Our algorithms uniquely combine hidden Markov models (HMMs) of haplotype diversity within a novel window-based framework to achieve superior accuracy as compared with published methods. Further, unlike previous methods, the structure of our HMM does not depend on the number of reference haplotypes but on a fixed constant, and it is thereby capable of utilizing large datasets while remaining highly efficient and robust to over-fitting. Through simulations and analysis of real data from 489 nuclear trio families from the mainland US, Puerto Rico and Mexico, we demonstrate that our methods achieve superior accuracy compared with published methods for local ancestry inference in Latinos.
Supplementary data are available at Bioinformatics online.
Polymorphisms in chemokine genes have been associated with human immunodeficiency virus (HIV)-related non-Hodgkin lymphoma (NHL) but are understudied in non-HIV-related NHL. Associations of NHL and NHL subtypes with polymorphisms and haplotypes in CCR5, CCR2, CCL5, CXCL12 and CX3CR1 were explored in a pooled analysis of three case-control studies (San Francisco Bay Area, California; United Kingdom; total: cases N=1610, controls N=1992). Adjusted unconditional logistic regression was used to estimate relative risks among HIV-negative non-Hispanic Caucasians. The CCR5M Δ32 deletion reduced the risk of NHL (odds ratio=0.56, 95% confidence interval=0.38-0.83) in men but not women with similar effects observed for diffuse large-cell and follicular lymphoma (FL). NHL risk also was reduced in men with the CCR2/CCR5 haplotype characterized by the CCR5 Δ32 deletion. The CCL5 −403A allele conferred reduced risks of FL and chronic lymphocytic leukemia/small lymphocytic lymphoma. Results should be interpreted conservatively. Continued investigation is warranted to confirm these findings.
Lymphoma non-Hodgkin; Chemokines; Polymorphism, genetic; Case-Control
Cellular aging is linked to deficiencies in efficient repair of DNA double strand breaks and authentic genome maintenance at the chromatin level. Aging poses a significant threat to adult stem cell function by triggering persistent DNA damage and ultimately cellular senescence. Senescence is often considered to be an irreversible process. Moreover, critical genomic regions engaged in persistent DNA damage accumulation are unknown. Here we report that 65% of naturally occurring repairable DNA damage in self-renewing adult stem cells occurs within transposable elements. Upregulation of Alu retrotransposon transcription upon ex vivo aging causes nuclear cytotoxicity associated with the formation of persistent DNA damage foci and loss of efficient DNA repair in pericentric chromatin. This occurs due to a failure to recruit of condensin I and cohesin complexes. Our results demonstrate that the cytotoxicity of induced Alu repeats is functionally relevant for the human adult stem cell aging. Stable suppression of Alu transcription can reverse the senescent phenotype, reinstating the cells' self-renewing properties and increasing their plasticity by altering so-called “master” pluripotency regulators.
adult stem cells; senescence; SINE/Alu transposons; DNA damage; H2AX; ChIP-seq; cohesin; condensin; PML body; induced pluripotency
Non-Hodgkin lymphoma (NHL) is a hematological malignancy of the immune system, and, as with autoimmune and inflammatory diseases (ADs), is influenced by genetic variation in the major histocompatibility complex (MHC). Persons with a history of specific ADs also have increased risk of NHL. As the coexistence of ADs and NHL could be caused by factors common to both diseases, here we examined whether some of the associated genetic signals are shared. Overlapping risk loci for NHL subytpes and several ADs were explored using data from genome-wide association studies. Several common genomic regions and susceptibility loci were identified suggesting a potential shared genetic background. Two independent MHC regions showed the main overlap, with several alleles in the human leukocyte antigen (HLA) Class II region exhibiting an opposite risk effect for follicular lymphoma and type I diabetes. These results support continued investigation to further elucidate the relationship between lymphoma and autoimmune diseases.
Non-Hodgkin lymphoma; Autoimmune diseases; Genome-wide Association Studies; Human Leukocyte Antigen
Haplotype phasing is a well studied problem in the context of genotype data. With the recent developments in high-throughput sequencing, new algorithms are needed for haplotype phasing, when the number of samples sequenced is low and when the sequencing coverage is blow. High-throughput sequencing technologies enables new possibilities for the inference of haplotypes. Since each read is originated from a single chromosome, all the variant sites it covers must derive from the same haplotype. Moreover, the sequencing process yields much higher SNP density than previous methods, resulting in a higher correlation between neighboring SNPs. We offer a new approach for haplotype phasing, which leverages on these two properties. Our suggested algorithm, called Perfect Phlogeny Haplotypes from Sequencing (PPHS) uses a perfect phylogeny model and it models the sequencing errors explicitly. We evaluated our method on real and simulated data, and we demonstrate that the algorithm outperforms previous methods when the sequencing error rate is high or when coverage is low.
RNA-Seq is a technique that uses Next Generation Sequencing to identify transcripts and estimate transcription levels. When applying this technique for quantification, one must contend with reads that align to multiple positions in the genome (multireads). Previous efforts to resolve multireads have shown that RNA-Seq expression estimation can be improved using probabilistic allocation of reads to genes. These methods use a probabilistic generative model for data generation and resolve ambiguity using likelihood-based approaches. In many instances, RNA-seq experiments are performed in the context of a population. The generative models of current methods do not take into account such population information, and it is an open question whether this information can improve quantification of the individual samples
In order to explore the contribution of population level information in RNA-seq quantification, we apply a hierarchical probabilistic generative model, which assumes that expression levels of different individuals are sampled from a Dirichlet distribution with parameters specific to the population, and reads are sampled from the distribution of expression levels. We introduce an optimization procedure for the estimation of the model parameters, and use HapMap data and simulated data to demonstrate that the model yields a significant improvement in the accuracy of expression levels of paralogous genes.
We provide a proof of principal of the benefit of drawing on population commonalities to estimate expression. The results of our experiments demonstrate this approach can be beneficial, primarily for estimation at the gene level.
genome-wide association study; genetic epidemiology; genetics; subclinical atherosclerosis; carotid intima media thickness; cardiovascular disease; cohort study; meta-analysis; risk
The availability of metagenomic sequencing data, generated by sequencing DNA pooled from multiple microbes living jointly, has increased sharply in the last few years with developments in sequencing technology. Characterizing the contents of metagenomic samples is a challenging task, which has been extensively attempted by both supervised and unsupervised techniques, each with its own limitations. Common to practically all the methods is the processing of single samples only; when multiple samples are sequenced, each is analyzed separately and the results are combined. In this paper we propose to perform a combined analysis of a set of samples in order to obtain a better characterization of each of the samples, and provide two applications of this principle. First, we use an unsupervised probabilistic mixture model to infer hidden components shared across metagenomic samples. We incorporate the model in a novel framework for studying association of microbial sequence elements with phenotypes, analogous to the genome-wide association studies performed on human genomes: We demonstrate that stratification may result in false discoveries of such associations, and that the components inferred by the model can be used to correct for this stratification. Second, we propose a novel read clustering (also termed “binning”) algorithm which operates on multiple samples simultaneously, leveraging on the assumption that the different samples contain the same microbial species, possibly in different proportions. We show that integrating information across multiple samples yields more precise binning on each of the samples. Moreover, for both applications we demonstrate that given a fixed depth of coverage, the average per-sample performance generally increases with the number of sequenced samples as long as the per-sample coverage is high enough.
Microorganisms are extremely abundant and diverse, and occupy almost every habitat on earth. Most of these habitats contain a complex mixture of many different microorganisms, and the characterization of these metagenomic mixtures, in terms of both taxonomy and function, is of great interest to science and medicine. Current sequencing technologies produce large numbers of short DNA reads copied from the genomes of a metagenomic sample, which can be used to obtain a high resolution characterization of such samples. However, the analysis of such data is complicated by the fact that one cannot tell which sequencing reads originated from the same genome. We show that the joint analysis of multiple metagenomic samples, which takes advantage of the fact that the samples share common microbial types, achieves better single-sample characterization compared to the current analysis methods that operate on single samples only. We demonstrate how this approach can be used to infer microbial components without the use of external sequence data, and to cluster sequencing reads according to their species of origin. In both cases we show that the joint analysis enhances the average single-sample performance, thus providing better sample characterization.
Does exposure to terrorism lead to hostility toward minorities? Drawing on theories from clinical and social psychology, we propose a stress-based model of political extremism in which psychological distress—which is largely overlooked in political scholarship—and threat perceptions mediate the relationship between exposure to terrorism and attitudes toward minorities. To test the model, a representative sample of 469 Israeli Jewish respondents was interviewed on three occasions at six-month intervals. Structural Equation Modeling indicated that exposure to terrorism predicted psychological distress (t1), which predicted perceived threat from Palestinian citizens of Israel (t2), which, in turn, predicted exclusionist attitudes toward Palestinian citizens of Israel (t3). These findings provide solid evidence and a mechanism for the hypothesis that terrorism introduces nondemocratic attitudes threatening minority rights. It suggests that psychological distress plays an important role in political decision making and should be incorporated in models drawing upon political psychology.
terrorism; stress; psychological distress; threat perceptions; minority rights; political attitudes; extremism
Major political events such as terrorist attacks and forced relocation of citizens may have an immediate effect on attitudes towards ethnic minorities associated with these events. The psychological process that leads to political exclusionism of minority groups was examined using a field study among Israeli settlers in Gaza days prior to the Disengagement Plan adopted by the Israeli government on June 6, 2004 and enacted in August 2005. Lending credence to integrated threat theory and to theory on authoritarianism, our analyses show that the positive effect of religiosity on political exclusionism results from the two-staged mediation of authoritarianism and perceived threat. We conclude that religiosity fosters authoritarianism, which in turn tends to move people towards exclusionism both directly and through the mediation of perceived threat.
Exclusionism; Authoritarianism; Perceived threat; Terrorist attacks
This study analyses the antecedents of exclusionist political attitudes towards Palestinian citizens of Israel among Israeli immigrants from the former Soviet Union in comparison to Old Jewish Israelis (OJI). A large-scale study of exclusionist political attitudes was conducted in the face of ongoing terrorism in Israel through telephone surveys carried out in September 2003 with 641 OJI and 131 immigrants. The main goal of the survey was to estimate the influence of perceived loss and gain of resources—as a consequence of terror—on attitudes towards Palestinian Israelis, while controlling for other relevant predictors of exclusionism—i.e. authoritarianism or threat perception. Findings obtained via interaction analyses and structural equation modelling show that a) immigrants display higher levels of exclusionist political attitudes towards Palestinian citizens of Israel than OJI; b) loss of resources, authoritarianism, and hawkish (rightist) worldviews predict exclusionist political attitudes among both immigrants and non-immigrants; c) failure to undergo post-traumatic growth (resource gain) in response to terrorism (e.g. finding meaning in life, becoming closer to others) is a significant predictor of exclusionist political attitudes only among immigrants.
Ethnic Relations; Intolerance; Israel; Arabs; Immigration; Terror
Matrin 3 (MATR3) is a highly conserved, inner nuclear matrix protein with two zinc finger domains and two RNA recognition motifs (RRM), whose function is largely unknown. Recently we found MATR3 to be phosphorylated by the protein kinase ATM, which activates the cellular response to double strand breaks in the DNA. Here, we show that MATR3 interacts in an RNA-dependent manner with several proteins with established roles in RNA processing, and maintains its interaction with RNA via its RRM2 domain. Deep sequencing of the bound RNA (RIP-seq) identified several small noncoding RNA species. Using microarray analysis to explore MATR3′s role in transcription, we identified 77 transcripts whose amounts depended on the presence of MATR3. We validated this finding with nine transcripts which were also bound to the MATR3 complex. Finally, we demonstrated the importance of MATR3 for maintaining the stability of several of these mRNA species and conclude that it has a role in mRNA stabilization. The data suggest that the cellular level of MATR3, known to be highly regulated, modulates the stability of a group of gene transcripts.
Recent advances in sequencing technologies set the stage for large, population based studies, in which the ANA or RNA of thousands of individuals will be sequenced. Currently, however, such studies are still infeasible using a straightforward sequencing approach; as a result, recently a few multiplexing schemes have been suggested, in which a small number of ANA pools are sequenced, and the results are then deconvoluted using compressed sensing or similar approaches. These methods, however, are limited to the detection of rare variants.
In this paper we provide a new algorithm for the deconvolution of DNA pools multiplexing schemes. The presented algorithm utilizes a likelihood model and linear programming. The approach allows for the addition of external data, particularly imputation data, resulting in a flexible environment that is suitable for different applications.
Particularly, we demonstrate that both low and high allele frequency SNPs can be accurately genotyped when the DNA pooling scheme is performed in conjunction with microarray genotyping and imputation. Additionally, we demonstrate the use of our framework for the detection of cancer fusion genes from RNA sequences.
Recent genome-wide association studies (GWAS) of myocardial infarction (MI) and other forms of coronary artery disease (CAD) have led to the discovery of at least 13 genetic loci. In addition to the effect size, power to detect associations is largely driven by sample size. Therefore, to maximize the chance of finding novel susceptibility loci for CAD and MI, the Coronary ARtery DIsease Genome-wide Replication And Meta-analysis (CARDIoGRAM) consortium was formed.
Methods and Results
CARDIoGRAM combines data from all published and several unpublished GWAS in individuals with European ancestry; includes >22 000 cases with CAD, MI, or both and >60 000 controls; and unifies samples from the Atherosclerotic Disease VAscular functioN and genetiC Epidemiology study, CADomics, Cohorts for Heart and Aging Research in Genomic Epidemiology, deCODE, the German Myocardial Infarction Family Studies I, II, and III, Ludwigshafen Risk and Cardiovascular Heath Study/AtheroRemo, MedStar, Myocardial Infarction Genetics Consortium, Ottawa Heart Genomics Study, PennCath, and the Wellcome Trust Case Control Consortium. Genotyping was carried out on Affymetrix or Illumina platforms followed by imputation of genotypes in most studies. On average, 2.2 million single nucleotide polymorphisms were generated per study. The results from each study are combined using meta-analysis. As proof of principle, we meta-analyzed risk variants at 9p21 and found that rs1333049 confers a 29% increase in risk for MI per copy (P=2×10−20).
CARDIoGRAM is poised to contribute to our understanding of the role of common genetic variation on risk for CAD and MI.
coronary artery disease; myocardial infarction; meta-analysis; genetics
To identify susceptibility loci for non-Hodgkin lymphoma (NHL) subtypes, we conducted a three-stage genome-wide association study. We identified two variants associated with follicular lymphoma (FL) in 1,465 FL cases/6,958 controls at 6p21.32 (rs10484561, rs7755224, r2=1.0; combined p-values=1.12×10-29, 2.00×10-19), providing further support that MHC genetic variation influences FL susceptibility. Confirmatory evidence of a previously reported association was also found between chronic lymphocytic leukemia/small lymphocytic lymphoma and rs735665 (combined p-value=4.24×10-9).
The prevalence of common chronic non-communicable diseases (CNCDs) far overshadows the prevalence of both monogenic and infectious diseases combined. All CNCDs, also called complex genetic diseases, have a heritable genetic component that can be used for pre-symptomatic risk assessment. Common single nucleotide polymorphisms (SNPs) that tag risk haplotypes across the genome currently account for a non-trivial portion of the germ-line genetic risk and we will likely continue to identify the remaining missing heritability in the form of rare variants, copy number variants and epigenetic modifications. Here, we describe a novel measure for calculating the lifetime risk of a disease, called the genetic composite index (GCI), and demonstrate its predictive value as a clinical classifier. The GCI only considers summary statistics of the effects of genetic variation and hence does not require the results of large-scale studies simultaneously assessing multiple risk factors. Combining GCI scores with environmental risk information provides an additional tool for clinical decision-making. The GCI can be populated with heritable risk information of any type, and thus represents a framework for CNCD pre-symptomatic risk assessment that can be populated as additional risk information is identified through next-generation technologies.
We conducted genome-wide association studies of non-Hodgkin lymphoma using Illumina HumanHap550 BeadChips to identify subtype-specific associations in follicular, diffuse large B-cell and chronic lymphocytic leukemia/small lymphocytic lymphomas. We found that rs6457327 on 6p21.33 was associated with susceptibility to follicular lymphoma (FL, N=189 cases/592 controls) with validation in an additional 456 FL cases and 2,785 controls (combined allelic p-value=4.7×10−11). The region of strongest association overlaps C6orf15(STG), located near psoriasis susceptibility region 1(PSORS1).
Recent clinical evidence suggests important role of lipid and amino acid metabolism in early pre-autoimmune stages of type 1 diabetes pathogenesis. We study the molecular paths associated with the incidence of insulitis and type 1 diabetes in the Non-Obese Diabetic (NOD) mouse model using available gene expression data from the pancreatic tissue from young pre-diabetic mice. We apply a graph-theoretic approach by using a modified color coding algorithm to detect optimal molecular paths associated with specific phenotypes in an integrated biological network encompassing heterogeneous interaction data types. In agreement with our recent clinical findings, we identified a path downregulated in early insulitis involving dihydroxyacetone phosphate acyltransferase (DHAPAT), a key regulator of ether phospholipid synthesis. The pathway involving serine/threonine-protein phosphatase (PP2A), an upstream regulator of lipid metabolism and insulin secretion, was found upregulated in early insulitis. Our findings provide further evidence for an important role of lipid metabolism in early stages of type 1 diabetes pathogenesis, as well as suggest that such dysregulation of lipids and related increased oxidative stress can be tracked to beta cells.
A characterization of the genetic variation of recently admixed populations may reveal historical population events, and is useful for the detection of single nucleotide polymorphisms (SNPs) associated with diseases through association studies and admixture mapping. Inference of locus-specific ancestry is key to our understanding of the genetic variation of such populations. While a number of methods for the inference of locus-specific ancestry are accurate when the ancestral populations are quite distant (e.g. African–Americans), current methods incur a large error rate when inferring the locus-specific ancestry in admixed populations where the ancestral populations are closely related (e.g. Americans of European descent).
Results: In this work, we extend previous methods for the inference of locus-specific ancestry by the incorporation of a refined model of recombination events. We present an efficient dynamic programming algorithm to infer the locus-specific ancestries in this model, resulting in a method that attains improved accuracies; the improvement is most significant when the ancestral populations are closely related. An evaluation on a wide range of scenarios, including admixtures of the 52 population groups from the Human Genome Diversity Project demonstrates that locus-specific ancestry can indeed be accurately inferred in these admixtures using our method. Finally, we demonstrate that imputation methods can be improved by the incorporation of locus-specific ancestry, when applied to admixed populations.
Availability: The implementation of the WINPOP model is available as part of the LAMP package at http://lamp.icsi.berkeley.edu/lamp
Non-Hodgkin lymphoma (NHL) is the fifth most common cancer in the U.S. and few causes have been identified. Genetic association studies may help identify environmental risk factors and enhance our understanding of disease mechanisms.
768 coding and haplotype tagging SNPs in 146 genes were examined using Illumina GoldenGate technology in a large population-based case-control study of NHL in the San Francisco Bay Area (1,292 cases 1,375 controls are included here). Statistical analyses were restricted to HIV- participants of white non-Hispanic origin. Genes involved in steroidogenesis, immune function, cell signaling, sunlight exposure, xenobiotic metabolism/oxidative stress, energy balance, and uptake and metabolism of cholesterol, folate and vitamin C were investigated. Sixteen SNPs in eight pathways and nine haplotypes were associated with NHL after correction for multiple testing at the adjusted q<0.10 level. Eight SNPs were tested in an independent case-control study of lymphoma in Germany (494 NHL cases and 494 matched controls). Novel associations with common variants in estrogen receptor 1 (ESR1) and in the vitamin C receptor and matrix metalloproteinase gene families were observed. Four ESR1 SNPs were associated with follicular lymphoma (FL) in the U.S. study, with rs3020314 remaining associated with reduced risk of FL after multiple testing adjustments [odds ratio (OR) = 0.42, 95% confidence interval (CI) = 0.23–0.77) and replication in the German study (OR = 0.24, 95% CI = 0.06–0.94). Several SNPs and haplotypes in the matrix metalloproteinase-3 (MMP3) and MMP9 genes and in the vitamin C receptor genes, solute carrier family 23 member 1 (SLC23A1) and SLC23A2, showed associations with NHL risk.
Our findings suggest a role for estrogen, vitamin C and matrix metalloproteinases in the pathogenesis of NHL that will require further validation.
The genotyping of mother–father–child trios is a very useful tool in disease association studies, as trios eliminate population stratification effects and increase the accuracy of haplotype inference. Unfortunately, the use of trios for association studies may reduce power, since it requires the genotyping of three individuals where only four independent haplotypes are involved. We describe here a method for genotyping a trio using two DNA pools, thus reducing the cost of genotyping trios to that of genotyping two individuals. Furthermore, we present extensions to the method that exploit the linkage disequilibrium structure to compensate for missing data and genotyping errors. We evaluated our method on trios from CEPH pedigree 66 of the Coriell Institute. We demonstrate that the error rates in the genotype calls of the proposed protocol are comparable to those of standard genotyping techniques, although the cost is reduced considerably. The approach described is generic and it can be applied to any genotyping platform that achieves a reasonable precision of allele frequency estimates from pools of two individuals. Using this approach, future trio-based association studies may be able to increase the sample size by 50% for the same cost and thereby increase the power to detect associations.