Cochin Jews form a small and unique community on the Malabar coast in southwest India. While the arrival time of any putative Jewish ancestors of the community has been speculated to have taken place as far back as biblical times (King Solomon’s era), a Jewish community in the Malabar coast has been documented only since the 9th century CE. Here, we explore the genetic history of Cochin Jews by collecting and genotyping 21 community members and combining the data with that of 707 individuals from 72 other Indian, Jewish, and Pakistani populations, together with additional individuals from worldwide populations. We applied comprehensive genome-wide analyses based on principal component analysis, FST, ADMIXTURE, identity-by-descent sharing, admixture linkage disequilibrium decay, haplotype sharing, allele sharing autocorrelation decay and contrasting the X chromosome with the autosomes. We find that, as reported by several previous studies, the genetics of Cochin Jews resembles that of local Indian populations. However, we also identify considerable Jewish genetic ancestry that is not present in any other Indian or Pakistani populations (with the exception of the Jewish Bene Israel, which we characterized previously). Combined, Cochin Jews have both Jewish and Indian ancestry. Specifically, we detect a significant recent Jewish gene flow into this community 13–22 generations (~470–730 years) ago, with contributions from Yemenite, Sephardi, and Middle-Eastern Jews, in accordance with historical records. Genetic analyses also point to high endogamy and a recent population bottleneck in this population, which might explain the increased prevalence of some recessive diseases in Cochin Jews.
Electronic supplementary material
The online version of this article (doi:10.1007/s00439-016-1698-y) contains supplementary material, which is available to authorized users.
The Bene Israel Jewish community from West India is a unique population whose history before the 18th century remains largely unknown. Bene Israel members consider themselves as descendants of Jews, yet the identity of Jewish ancestors and their arrival time to India are unknown, with speculations on arrival time varying between the 8th century BCE and the 6th century CE. Here, we characterize the genetic history of Bene Israel by collecting and genotyping 18 Bene Israel individuals. Combining with 486 individuals from 41 other Jewish, Indian and Pakistani populations, and additional individuals from worldwide populations, we conducted comprehensive genome-wide analyses based on FST, principal component analysis, ADMIXTURE, identity-by-descent sharing, admixture linkage disequilibrium decay, haplotype sharing and allele sharing autocorrelation decay, as well as contrasted patterns between the X chromosome and the autosomes. The genetics of Bene Israel individuals resemble local Indian populations, while at the same time constituting a clearly separated and unique population in India. They are unique among Indian and Pakistani populations we analyzed in sharing considerable genetic ancestry with other Jewish populations. Putting together the results from all analyses point to Bene Israel being an admixed population with both Jewish and Indian ancestry, with the genetic contribution of each of these ancestral populations being substantial. The admixture took place in the last millennium, about 19–33 generations ago. It involved Middle-Eastern Jews and was sex-biased, with more male Jewish and local female contribution. It was followed by a population bottleneck and high endogamy, which can lead to increased prevalence of recessive diseases in this population. This study provides an example of how genetic analysis advances our knowledge of human history in cases where other disciplines lack the relevant data to do so.
The importance of hope in promoting conciliatory attitudes has been asserted in the field of conflict resolution. However, little is known about conditions inducing hope, especially in intractable conflicts, where reference to the outgroup may backfire. In the current research, five studies yielded convergent support for the hypothesis that hope for peace stems from a general perception of the world as changing. In Study 1, coders observed associations between belief in a changing world, hope regarding peace, and support for concessions. Study 2 revealed the hypothesized relations using self-reported measures. Studies 3 and 4 established causality by instilling a perception of the world as changing (vs. unchanging) using narrative and drawing manipulations. Study 5 compared the changing world message with a control condition during conflict escalation. Across studies, although the specific context was not referred to, the belief in a changing world increased support for concessions through hope for peace.
hope; emotions in conflict; belief in a changing world; intractable conflict
Disease risk and incidence between males and females reveal differences, and sex is an important component of any investigation of the determinants of phenotypes or disease etiology. Further striking differences between men and women are known, for instance, at the metabolic level. The extent to which men and women vary at the level of the epigenome, however, is not well documented. DNA methylation is the best known epigenetic mechanism to date.
In order to shed light on epigenetic differences, we compared autosomal DNA methylation levels between men and women in blood in a large prospective European cohort of 1799 subjects,
and replicated our findings in three independent European cohorts. We identified and validated 1184 CpG sites to be differentially methylated between men and women and observed that these CpG sites were distributed across all autosomes. We showed that some of the differentially methylated loci also exhibit differential gene expression between men and women. Finally, we found that the differentially methylated loci are enriched among imprinted genes, and that their genomic location in the genome is concentrated in CpG island shores.
Our epigenome-wide association study indicates that differences between men and women are so substantial that they should be considered in design and analyses of future studies.
Electronic supplementary material
The online version of this article (doi:10.1186/s13072-015-0035-3) contains supplementary material, which is available to authorized users.
DNA methylation; EWAS; Sex; Enrichment analysis; CpG; Imprinting; KORA; ALSPAC; EPICOR
Motivation: A basic problem of broad public and scientific interest is to use the DNA of an individual to infer the genomic ancestries of the parents. In particular, we are often interested in the fraction of each parent’s genome that comes from specific ancestries (e.g. European, African, Native American, etc). This has many applications ranging from understanding the inheritance of ancestry-related risks and traits to quantifying human assortative mating patterns.
Results: We model the problem of parental genomic ancestry inference as a pooled semi-Markov process. We develop a general mathematical framework for pooled semi-Markov processes and construct efficient inference algorithms for these models. Applying our inference algorithm to genotype data from 231 Mexican trios and 258 Puerto Rican trios where we have the true genomic ancestry of each parent, we demonstrate that our method accurately infers parameters of the semi-Markov processes and parents’ genomic ancestries. We additionally validated the method on simulations. Our model of pooled semi-Markov process and inference algorithms may be of independent interest in other settings in genomics and machine learning.
The recent advances in high-throughput sequencing technologies bring the potential of a better characterization of the genetic variation in humans and other organisms. In many occasions, either by design or by necessity, the sequencing procedure is performed on a pool of DNA samples with different abundances, where the abundance of each sample is unknown. Such a scenario is naturally occurring in the case of metagenomics analysis where a pool of bacteria is sequenced, or in the case of population studies involving DNA pools by design. Particularly, various pooling designs were recently suggested that can identify carriers of rare alleles in large cohorts, dramatically reducing the cost of such large-scale sequencing projects. A fundamental problem with such approaches for population studies is that the uncertainty of DNA proportions from different individuals in the pools might lead to spurious associations. Fortunately, it is often the case that the genotype data of at least some of the individuals in the pool is known. Here, we propose a method (eALPS) that uses the genotype data in conjunction with the pooled sequence data in order to accurately estimate the proportions of the samples in the pool, even in cases where not all individuals in the pool were genotyped (eALPS-LD). Using real data from a sequencing pooling study of non-Hodgkin's lymphoma, we demonstrate that the estimation of the proportions is crucial, since otherwise there is a risk for false discoveries. Additionally, we demonstrate that our approach is also applicable to the problem of quantification of species in metagenomics samples (eALPS-BCR) and is particularly suitable for metagenomic quantification of closely related species.
algorithms; alignment; cancer genomics; NP-completeness
Data from large Next Generation Sequencing (NGS) experiments present challenges both in terms of costs associated with storage and in time required for file transfer. It is sometimes possible to store only a summary relevant to particular applications, but generally it is desirable to keep all information needed to revisit experimental results in the future. Thus, the need for efficient lossless compression methods for NGS reads arises. It has been shown that NGS-specific compression schemes can improve results over generic compression methods, such as the Lempel-Ziv algorithm, Burrows-Wheeler transform, or Arithmetic Coding. When a reference genome is available, effective compression can be achieved by first aligning the reads to the reference genome, and then encoding each read using the alignment position combined with the differences in the read relative to the reference. These reference-based methods have been shown to compress better than reference-free schemes, but the alignment step they require demands several hours of CPU time on a typical dataset, whereas reference-free methods can usually compress in minutes.
We present a new approach that achieves highly efficient compression by using a reference genome, but completely circumvents the need for alignment, affording a great reduction in the time needed to compress. In contrast to reference-based methods that first align reads to the genome, we hash all reads into Bloom filters to encode, and decode by querying the same Bloom filters using read-length subsequences of the reference genome. Further compression is achieved by using a cascade of such filters.
Our method, called BARCODE, runs an order of magnitude faster than reference-based methods, while compressing an order of magnitude better than reference-free methods, over a broad range of sequencing coverage. In high coverage (50-100 fold), compared to the best tested compressors, BARCODE saves 80-90% of the running time while only increasing space slightly.
Lossless Compression; Bloom Filter; Storage; Sequencing; NGS; Alignment-free
Recent technological improvements in the field of genetic data extraction give rise to the possibility of reconstructing the historical pedigrees of entire populations from the genotypes of individuals living today. Current methods are still not practical for real data scenarios as they have limited accuracy and assume unrealistic assumptions of monogamy and synchronized generations. In order to address these issues, we develop a new method for pedigree reconstruction, , which is based on formulations of the pedigree reconstruction problem as variants of graph coloring. The new formulation allows us to consider features that were overlooked by previous methods, resulting in a reconstruction of up to 5 generations back in time, with an order of magnitude improvement of false-negatives rates over the state of the art, while keeping a lower level of false positive rates. We demonstrate the accuracy of compared to previous approaches using simulation studies over a range of population sizes, including inbred and outbred populations, monogamous and polygamous mating patterns, as well as synchronous and asynchronous mating.
Learning the correct relationships between individuals from genetic data is a basic theoretical problem in the field of genetics, and has many practical consequences. A wide variety of statistical methods for genetic analysis assume the relationships between individuals are known, and can manifest relatedness information to improve inference. The current state-of-the-art methods for relationship inference consider pair-wise genetic similarity, and use it to infer the relationship between each pair of individuals. Reconstructing the pedigrees of an entire population directly has the potential to use more elaborate relationship information, and thus obtains a better prediction of the familial relationships in the population. In contrast to the full set of pair-wise relationships in a population, genetic pedigrees provide a lossless and conflict-free structure for depicting the relationships between individuals. In an effort to make pedigree reconstruction practical we developed a new method, which is an order of magnitude more accurate than previous methods, and is the first method that has the ability to reconstruct polygamous pedigrees.
Motivation: Gene–gene interactions are of potential biological and medical interest, as they can shed light on both the inheritance mechanism of a trait and on the underlying biological mechanisms. Evidence of epistatic interactions has been reported in both humans and other organisms. Unlike single-locus genome-wide association studies (GWAS), which proved efficient in detecting numerous genetic loci related with various traits, interaction-based GWAS have so far produced very few reproducible discoveries. Such studies introduce a great computational and statistical burden by necessitating a large number of hypotheses to be tested including all pairs of single nucleotide polymorphisms (SNPs). Thus, many software tools have been developed for interaction-based case–control studies, some leading to reliable discoveries. For quantitative data, on the other hand, only a handful of tools exist, and the computational burden is still substantial.
Results: We present an efficient algorithm for detecting epistasis in quantitative GWAS, achieving a substantial runtime speedup by avoiding the need to exhaustively test all SNP pairs using metric embedding and random projections. Unlike previous metric embedding methods for case–control studies, we introduce a new embedding, where each SNP is mapped to two Euclidean spaces. We implemented our method in a tool named EPIQ (EPIstasis detection for Quantitative GWAS), and we show by simulations that EPIQ requires hours of processing time where other methods require days and sometimes weeks. Applying our method to a dataset from the Ludwigshafen risk and cardiovascular health study, we discovered a pair of SNPs with a near-significant interaction (P = 2.2 × 10−13), in only 1.5 h on 10 processors.
Motivation: Local ancestry analysis of genotype data from recently admixed populations (e.g. Latinos, African Americans) provides key insights into population history and disease genetics. Although methods for local ancestry inference have been extensively validated in simulations (under many unrealistic assumptions), no empirical study of local ancestry accuracy in Latinos exists to date. Hence, interpreting findings that rely on local ancestry in Latinos is challenging.
Results: Here, we use 489 nuclear families from the mainland USA, Puerto Rico and Mexico in conjunction with 3204 unrelated Latinos from the Multiethnic Cohort study to provide the first empirical characterization of local ancestry inference accuracy in Latinos. Our approach for identifying errors does not rely on simulations but on the observation that local ancestry in families follows Mendelian inheritance. We measure the rate of local ancestry assignments that lead to Mendelian inconsistencies in local ancestry in trios (MILANC), which provides a lower bound on errors in the local ancestry estimates. We show that MILANC rates observed in simulations underestimate the rate observed in real data, and that MILANC varies substantially across the genome. Second, across a wide range of methods, we observe that loci with large deviations in local ancestry also show enrichment in MILANC rates. Therefore, local ancestry estimates at such loci should be interpreted with caution. Finally, we reconstruct ancestral haplotype panels to be used as reference panels in local ancestry inference and show that ancestry inference is significantly improved by incoroprating these reference panels.
Availability and implementation: We provide the reconstructed reference panels together with the maps of MILANC rates as a public resource for researchers analyzing local ancestry in Latinos at http://bogdanlab.pathology.ucla.edu.
Supplementary data are available at Bioinformatics online.
There is tremendous potential for genome sequencing to improve clinical diagnosis and care once it becomes routinely accessible, but this will require formalizing research methods into clinical best practices in the areas of sequence data generation, analysis, interpretation and reporting. The CLARITY Challenge was designed to spur convergence in methods for diagnosing genetic disease starting from clinical case history and genome sequencing data. DNA samples were obtained from three families with heritable genetic disorders and genomic sequence data were donated by sequencing platform vendors. The challenge was to analyze and interpret these data with the goals of identifying disease-causing variants and reporting the findings in a clinically useful format. Participating contestant groups were solicited broadly, and an independent panel of judges evaluated their performance.
A total of 30 international groups were engaged. The entries reveal a general convergence of practices on most elements of the analysis and interpretation process. However, even given this commonality of approach, only two groups identified the consensus candidate variants in all disease cases, demonstrating a need for consistent fine-tuning of the generally accepted methods. There was greater diversity of the final clinical report content and in the patient consenting process, demonstrating that these areas require additional exploration and standardization.
The CLARITY Challenge provides a comprehensive assessment of current practices for using genome sequencing to diagnose and report genetic diseases. There is remarkable convergence in bioinformatic techniques, but medical interpretation and reporting are areas that require further development by many groups.
Copy number variations (CNVs) are widely known to be an important mediator for diseases and traits. The development of high-throughput sequencing (HTS) technologies has provided great opportunities to identify CNV regions in mammalian genomes. In a typical experiment, millions of short reads obtained from a genome of interest are mapped to a reference genome. The mapping information can be used to identify CNV regions. One important challenge in analyzing the mapping information is the large fraction of reads that can be mapped to multiple positions. Most existing methods either only consider reads that can be uniquely mapped to the reference genome or randomly place a read to one of its mapping positions. Therefore, these methods have low power to detect CNVs located within repeated sequences. In this study, we propose a probabilistic model, CNVeM, that utilizes the inherent uncertainty of read mapping. We use maximum likelihood to estimate locations and copy numbers of copied regions and implement an expectation-maximization (EM) algorithm. One important contribution of our model is that we can distinguish between regions in the reference genome that differ from each other by as little as 0.1%. As our model aims to predict the copy number of each nucleotide, we can predict the CNV boundaries with high resolution. We apply our method to simulated datasets and achieve higher accuracy compared to CNVnator. Moreover, we apply our method to real data from which we detected known CNVs. To our knowledge, this is the first attempt to predict CNVs at nucleotide resolution and to utilize uncertainty of read mapping.
algorithms; next generation sequencing; statistical models; structural genomics
RNA viruses exist in their hosts as populations of different but related strains. The virus population, often called quasispecies, is shaped by a combination of genetic change and natural selection. Genetic change is due to both point mutations and recombination events. We present a jumping hidden Markov model that describes the generation of viral quasispecies and a method to infer its parameters from next-generation sequencing data. The model introduces position-specific probability tables over the sequence alphabet to explain the diversity that can be found in the population at each site. Recombination events are indicated by a change of state, allowing a single observed read to originate from multiple sequences. We present a specific implementation of the expectation maximization (EM) algorithm to find maximum a posteriori estimates of the model parameters and a method to estimate the distribution of viral strains in the quasispecies. The model is validated on simulated data, showing the advantage of explicitly taking the recombination process into account, and applied to reads obtained from a clinical HIV sample.
evolution; HMM; statistical models; viruses
An important component in the analysis of genome-wide association studies involves the imputation of genotypes that have not been measured directly in the studied samples. The imputation procedure uses the linkage disequilibrium (LD) structure in the population to infer the genotype of an unobserved single nucleotide polymorphism. The LD structure is normally learned from a dense genotype map of a reference population that matches the studied population. In many instances there is no reference population that exactly matches the studied population, and a natural question arises as to how to choose the reference population for the imputation. Here we present a Coalescent-based method that addresses this issue. In contrast to the current paradigm of imputation methods, our method assigns a different reference dataset for each sample in the studied population, and for each region in the genome. This allows the flexibility to account for the diversity within populations, as well as across populations. Furthermore, because our approach treats each region in the genome separately, our method is suitable for the imputation of recently admixed populations. We evaluated our method across a large set of populations and found that our choice of reference data set considerably improves the accuracy of imputation, especially for regions with low LD and for populations without a reference population available as well as for admixed populations such as the Hispanic population. Our method is generic and can potentially be incorporated in any of the available imputation methods as an add-on.
genotype imputation; coalescent; GWAS; linkage disequilibrium; weighted panel
Characterizing genetic diversity within and between populations has broad applications in studies of human disease and evolution. We propose a new approach, spatial ancestry analysis, for the modeling of genotypes in two- or three-dimensional space. In spatial ancestry analysis (SPA), we explicitly model the spatial distribution of each SNP by assigning an allele frequency as a continuous function in geographic space. We show that the explicit modeling of the allele frequency allows individuals to be localized on the map on the basis of their genetic information alone. We apply our SPA method to a European and a worldwide population genetic variation data set and identify SNPs showing large gradients in allele frequency, and we suggest these as candidate regions under selection. These regions include SNPs in the well-characterized LCT region, as well as at loci including FOXP2, OCA2 and LRP1B.
Motivation: It is becoming increasingly evident that the analysis of genotype data from recently admixed populations is providing important insights into medical genetics and population history. Such analyses have been used to identify novel disease loci, to understand recombination rate variation and to detect recent selection events. The utility of such studies crucially depends on accurate and unbiased estimation of the ancestry at every genomic locus in recently admixed populations. Although various methods have been proposed and shown to be extremely accurate in two-way admixtures (e.g. African Americans), only a few approaches have been proposed and thoroughly benchmarked on multi-way admixtures (e.g. Latino populations of the Americas).
Results: To address these challenges we introduce here methods for local ancestry inference which leverage the structure of linkage disequilibrium in the ancestral population (LAMP-LD), and incorporate the constraint of Mendelian segregation when inferring local ancestry in nuclear family trios (LAMP-HAP). Our algorithms uniquely combine hidden Markov models (HMMs) of haplotype diversity within a novel window-based framework to achieve superior accuracy as compared with published methods. Further, unlike previous methods, the structure of our HMM does not depend on the number of reference haplotypes but on a fixed constant, and it is thereby capable of utilizing large datasets while remaining highly efficient and robust to over-fitting. Through simulations and analysis of real data from 489 nuclear trio families from the mainland US, Puerto Rico and Mexico, we demonstrate that our methods achieve superior accuracy compared with published methods for local ancestry inference in Latinos.
Supplementary data are available at Bioinformatics online.
Polymorphisms in chemokine genes have been associated with human immunodeficiency virus (HIV)-related non-Hodgkin lymphoma (NHL) but are understudied in non-HIV-related NHL. Associations of NHL and NHL subtypes with polymorphisms and haplotypes in CCR5, CCR2, CCL5, CXCL12 and CX3CR1 were explored in a pooled analysis of three case-control studies (San Francisco Bay Area, California; United Kingdom; total: cases N=1610, controls N=1992). Adjusted unconditional logistic regression was used to estimate relative risks among HIV-negative non-Hispanic Caucasians. The CCR5M Δ32 deletion reduced the risk of NHL (odds ratio=0.56, 95% confidence interval=0.38-0.83) in men but not women with similar effects observed for diffuse large-cell and follicular lymphoma (FL). NHL risk also was reduced in men with the CCR2/CCR5 haplotype characterized by the CCR5 Δ32 deletion. The CCL5 −403A allele conferred reduced risks of FL and chronic lymphocytic leukemia/small lymphocytic lymphoma. Results should be interpreted conservatively. Continued investigation is warranted to confirm these findings.
Lymphoma non-Hodgkin; Chemokines; Polymorphism, genetic; Case-Control
Cellular aging is linked to deficiencies in efficient repair of DNA double strand breaks and authentic genome maintenance at the chromatin level. Aging poses a significant threat to adult stem cell function by triggering persistent DNA damage and ultimately cellular senescence. Senescence is often considered to be an irreversible process. Moreover, critical genomic regions engaged in persistent DNA damage accumulation are unknown. Here we report that 65% of naturally occurring repairable DNA damage in self-renewing adult stem cells occurs within transposable elements. Upregulation of Alu retrotransposon transcription upon ex vivo aging causes nuclear cytotoxicity associated with the formation of persistent DNA damage foci and loss of efficient DNA repair in pericentric chromatin. This occurs due to a failure to recruit of condensin I and cohesin complexes. Our results demonstrate that the cytotoxicity of induced Alu repeats is functionally relevant for the human adult stem cell aging. Stable suppression of Alu transcription can reverse the senescent phenotype, reinstating the cells' self-renewing properties and increasing their plasticity by altering so-called “master” pluripotency regulators.
adult stem cells; senescence; SINE/Alu transposons; DNA damage; H2AX; ChIP-seq; cohesin; condensin; PML body; induced pluripotency
Non-Hodgkin lymphoma (NHL) is a hematological malignancy of the immune system, and, as with autoimmune and inflammatory diseases (ADs), is influenced by genetic variation in the major histocompatibility complex (MHC). Persons with a history of specific ADs also have increased risk of NHL. As the coexistence of ADs and NHL could be caused by factors common to both diseases, here we examined whether some of the associated genetic signals are shared. Overlapping risk loci for NHL subytpes and several ADs were explored using data from genome-wide association studies. Several common genomic regions and susceptibility loci were identified suggesting a potential shared genetic background. Two independent MHC regions showed the main overlap, with several alleles in the human leukocyte antigen (HLA) Class II region exhibiting an opposite risk effect for follicular lymphoma and type I diabetes. These results support continued investigation to further elucidate the relationship between lymphoma and autoimmune diseases.
Non-Hodgkin lymphoma; Autoimmune diseases; Genome-wide Association Studies; Human Leukocyte Antigen
Haplotype phasing is a well studied problem in the context of genotype data. With the recent developments in high-throughput sequencing, new algorithms are needed for haplotype phasing, when the number of samples sequenced is low and when the sequencing coverage is blow. High-throughput sequencing technologies enables new possibilities for the inference of haplotypes. Since each read is originated from a single chromosome, all the variant sites it covers must derive from the same haplotype. Moreover, the sequencing process yields much higher SNP density than previous methods, resulting in a higher correlation between neighboring SNPs. We offer a new approach for haplotype phasing, which leverages on these two properties. Our suggested algorithm, called Perfect Phlogeny Haplotypes from Sequencing (PPHS) uses a perfect phylogeny model and it models the sequencing errors explicitly. We evaluated our method on real and simulated data, and we demonstrate that the algorithm outperforms previous methods when the sequencing error rate is high or when coverage is low.
RNA-Seq is a technique that uses Next Generation Sequencing to identify transcripts and estimate transcription levels. When applying this technique for quantification, one must contend with reads that align to multiple positions in the genome (multireads). Previous efforts to resolve multireads have shown that RNA-Seq expression estimation can be improved using probabilistic allocation of reads to genes. These methods use a probabilistic generative model for data generation and resolve ambiguity using likelihood-based approaches. In many instances, RNA-seq experiments are performed in the context of a population. The generative models of current methods do not take into account such population information, and it is an open question whether this information can improve quantification of the individual samples
In order to explore the contribution of population level information in RNA-seq quantification, we apply a hierarchical probabilistic generative model, which assumes that expression levels of different individuals are sampled from a Dirichlet distribution with parameters specific to the population, and reads are sampled from the distribution of expression levels. We introduce an optimization procedure for the estimation of the model parameters, and use HapMap data and simulated data to demonstrate that the model yields a significant improvement in the accuracy of expression levels of paralogous genes.
We provide a proof of principal of the benefit of drawing on population commonalities to estimate expression. The results of our experiments demonstrate this approach can be beneficial, primarily for estimation at the gene level.
genome-wide association study; genetic epidemiology; genetics; subclinical atherosclerosis; carotid intima media thickness; cardiovascular disease; cohort study; meta-analysis; risk
The availability of metagenomic sequencing data, generated by sequencing DNA pooled from multiple microbes living jointly, has increased sharply in the last few years with developments in sequencing technology. Characterizing the contents of metagenomic samples is a challenging task, which has been extensively attempted by both supervised and unsupervised techniques, each with its own limitations. Common to practically all the methods is the processing of single samples only; when multiple samples are sequenced, each is analyzed separately and the results are combined. In this paper we propose to perform a combined analysis of a set of samples in order to obtain a better characterization of each of the samples, and provide two applications of this principle. First, we use an unsupervised probabilistic mixture model to infer hidden components shared across metagenomic samples. We incorporate the model in a novel framework for studying association of microbial sequence elements with phenotypes, analogous to the genome-wide association studies performed on human genomes: We demonstrate that stratification may result in false discoveries of such associations, and that the components inferred by the model can be used to correct for this stratification. Second, we propose a novel read clustering (also termed “binning”) algorithm which operates on multiple samples simultaneously, leveraging on the assumption that the different samples contain the same microbial species, possibly in different proportions. We show that integrating information across multiple samples yields more precise binning on each of the samples. Moreover, for both applications we demonstrate that given a fixed depth of coverage, the average per-sample performance generally increases with the number of sequenced samples as long as the per-sample coverage is high enough.
Microorganisms are extremely abundant and diverse, and occupy almost every habitat on earth. Most of these habitats contain a complex mixture of many different microorganisms, and the characterization of these metagenomic mixtures, in terms of both taxonomy and function, is of great interest to science and medicine. Current sequencing technologies produce large numbers of short DNA reads copied from the genomes of a metagenomic sample, which can be used to obtain a high resolution characterization of such samples. However, the analysis of such data is complicated by the fact that one cannot tell which sequencing reads originated from the same genome. We show that the joint analysis of multiple metagenomic samples, which takes advantage of the fact that the samples share common microbial types, achieves better single-sample characterization compared to the current analysis methods that operate on single samples only. We demonstrate how this approach can be used to infer microbial components without the use of external sequence data, and to cluster sequencing reads according to their species of origin. In both cases we show that the joint analysis enhances the average single-sample performance, thus providing better sample characterization.
Does exposure to terrorism lead to hostility toward minorities? Drawing on theories from clinical and social psychology, we propose a stress-based model of political extremism in which psychological distress—which is largely overlooked in political scholarship—and threat perceptions mediate the relationship between exposure to terrorism and attitudes toward minorities. To test the model, a representative sample of 469 Israeli Jewish respondents was interviewed on three occasions at six-month intervals. Structural Equation Modeling indicated that exposure to terrorism predicted psychological distress (t1), which predicted perceived threat from Palestinian citizens of Israel (t2), which, in turn, predicted exclusionist attitudes toward Palestinian citizens of Israel (t3). These findings provide solid evidence and a mechanism for the hypothesis that terrorism introduces nondemocratic attitudes threatening minority rights. It suggests that psychological distress plays an important role in political decision making and should be incorporated in models drawing upon political psychology.
terrorism; stress; psychological distress; threat perceptions; minority rights; political attitudes; extremism
Major political events such as terrorist attacks and forced relocation of citizens may have an immediate effect on attitudes towards ethnic minorities associated with these events. The psychological process that leads to political exclusionism of minority groups was examined using a field study among Israeli settlers in Gaza days prior to the Disengagement Plan adopted by the Israeli government on June 6, 2004 and enacted in August 2005. Lending credence to integrated threat theory and to theory on authoritarianism, our analyses show that the positive effect of religiosity on political exclusionism results from the two-staged mediation of authoritarianism and perceived threat. We conclude that religiosity fosters authoritarianism, which in turn tends to move people towards exclusionism both directly and through the mediation of perceived threat.
Exclusionism; Authoritarianism; Perceived threat; Terrorist attacks