Search tips
Search criteria

Results 1-25 (32)

Clipboard (0)

Select a Filter Below

more »
Year of Publication
1.  eALPS: Estimating Abundance Levels in Pooled Sequencing Using Available Genotyping Data 
Journal of Computational Biology  2013;20(11):861-877.
The recent advances in high-throughput sequencing technologies bring the potential of a better characterization of the genetic variation in humans and other organisms. In many occasions, either by design or by necessity, the sequencing procedure is performed on a pool of DNA samples with different abundances, where the abundance of each sample is unknown. Such a scenario is naturally occurring in the case of metagenomics analysis where a pool of bacteria is sequenced, or in the case of population studies involving DNA pools by design. Particularly, various pooling designs were recently suggested that can identify carriers of rare alleles in large cohorts, dramatically reducing the cost of such large-scale sequencing projects. A fundamental problem with such approaches for population studies is that the uncertainty of DNA proportions from different individuals in the pools might lead to spurious associations. Fortunately, it is often the case that the genotype data of at least some of the individuals in the pool is known. Here, we propose a method (eALPS) that uses the genotype data in conjunction with the pooled sequence data in order to accurately estimate the proportions of the samples in the pool, even in cases where not all individuals in the pool were genotyped (eALPS-LD). Using real data from a sequencing pooling study of non-Hodgkin's lymphoma, we demonstrate that the estimation of the proportions is crucial, since otherwise there is a risk for false discoveries. Additionally, we demonstrate that our approach is also applicable to the problem of quantification of species in metagenomics samples (eALPS-BCR) and is particularly suitable for metagenomic quantification of closely related species.
PMCID: PMC4013753  PMID: 24144111
algorithms; alignment; cancer genomics; NP-completeness
2.  Fast lossless compression via cascading Bloom filters 
BMC Bioinformatics  2014;15(Suppl 9):S7.
Data from large Next Generation Sequencing (NGS) experiments present challenges both in terms of costs associated with storage and in time required for file transfer. It is sometimes possible to store only a summary relevant to particular applications, but generally it is desirable to keep all information needed to revisit experimental results in the future. Thus, the need for efficient lossless compression methods for NGS reads arises. It has been shown that NGS-specific compression schemes can improve results over generic compression methods, such as the Lempel-Ziv algorithm, Burrows-Wheeler transform, or Arithmetic Coding. When a reference genome is available, effective compression can be achieved by first aligning the reads to the reference genome, and then encoding each read using the alignment position combined with the differences in the read relative to the reference. These reference-based methods have been shown to compress better than reference-free schemes, but the alignment step they require demands several hours of CPU time on a typical dataset, whereas reference-free methods can usually compress in minutes.
We present a new approach that achieves highly efficient compression by using a reference genome, but completely circumvents the need for alignment, affording a great reduction in the time needed to compress. In contrast to reference-based methods that first align reads to the genome, we hash all reads into Bloom filters to encode, and decode by querying the same Bloom filters using read-length subsequences of the reference genome. Further compression is achieved by using a cascade of such filters.
Our method, called BARCODE, runs an order of magnitude faster than reference-based methods, while compressing an order of magnitude better than reference-free methods, over a broad range of sequencing coverage. In high coverage (50-100 fold), compared to the best tested compressors, BARCODE saves 80-90% of the running time while only increasing space slightly.
PMCID: PMC4168706  PMID: 25252952
Lossless Compression; Bloom Filter; Storage; Sequencing; NGS; Alignment-free
3.  Historical Pedigree Reconstruction from Extant Populations Using PArtitioning of RElatives (PREPARE) 
PLoS Computational Biology  2014;10(6):e1003610.
Recent technological improvements in the field of genetic data extraction give rise to the possibility of reconstructing the historical pedigrees of entire populations from the genotypes of individuals living today. Current methods are still not practical for real data scenarios as they have limited accuracy and assume unrealistic assumptions of monogamy and synchronized generations. In order to address these issues, we develop a new method for pedigree reconstruction, , which is based on formulations of the pedigree reconstruction problem as variants of graph coloring. The new formulation allows us to consider features that were overlooked by previous methods, resulting in a reconstruction of up to 5 generations back in time, with an order of magnitude improvement of false-negatives rates over the state of the art, while keeping a lower level of false positive rates. We demonstrate the accuracy of compared to previous approaches using simulation studies over a range of population sizes, including inbred and outbred populations, monogamous and polygamous mating patterns, as well as synchronous and asynchronous mating.
Author Summary
Learning the correct relationships between individuals from genetic data is a basic theoretical problem in the field of genetics, and has many practical consequences. A wide variety of statistical methods for genetic analysis assume the relationships between individuals are known, and can manifest relatedness information to improve inference. The current state-of-the-art methods for relationship inference consider pair-wise genetic similarity, and use it to infer the relationship between each pair of individuals. Reconstructing the pedigrees of an entire population directly has the potential to use more elaborate relationship information, and thus obtains a better prediction of the familial relationships in the population. In contrast to the full set of pair-wise relationships in a population, genetic pedigrees provide a lossless and conflict-free structure for depicting the relationships between individuals. In an effort to make pedigree reconstruction practical we developed a new method, which is an order of magnitude more accurate than previous methods, and is the first method that has the ability to reconstruct polygamous pedigrees.
PMCID: PMC4063675  PMID: 24945698
4.  EPIQ—efficient detection of SNP–SNP epistatic interactions for quantitative traits 
Bioinformatics  2014;30(12):i19-i25.
Motivation: Gene–gene interactions are of potential biological and medical interest, as they can shed light on both the inheritance mechanism of a trait and on the underlying biological mechanisms. Evidence of epistatic interactions has been reported in both humans and other organisms. Unlike single-locus genome-wide association studies (GWAS), which proved efficient in detecting numerous genetic loci related with various traits, interaction-based GWAS have so far produced very few reproducible discoveries. Such studies introduce a great computational and statistical burden by necessitating a large number of hypotheses to be tested including all pairs of single nucleotide polymorphisms (SNPs). Thus, many software tools have been developed for interaction-based case–control studies, some leading to reliable discoveries. For quantitative data, on the other hand, only a handful of tools exist, and the computational burden is still substantial.
Results: We present an efficient algorithm for detecting epistasis in quantitative GWAS, achieving a substantial runtime speedup by avoiding the need to exhaustively test all SNP pairs using metric embedding and random projections. Unlike previous metric embedding methods for case–control studies, we introduce a new embedding, where each SNP is mapped to two Euclidean spaces. We implemented our method in a tool named EPIQ (EPIstasis detection for Quantitative GWAS), and we show by simulations that EPIQ requires hours of processing time where other methods require days and sometimes weeks. Applying our method to a dataset from the Ludwigshafen risk and cardiovascular health study, we discovered a pair of SNPs with a near-significant interaction (P = 2.2 × 10−13), in only 1.5 h on 10 processors.
PMCID: PMC4229902  PMID: 24931983
5.  Analysis of Latino populations from GALA and MEC studies reveals genomic loci with biased local ancestry estimation 
Bioinformatics  2013;29(11):1407-1415.
Motivation: Local ancestry analysis of genotype data from recently admixed populations (e.g. Latinos, African Americans) provides key insights into population history and disease genetics. Although methods for local ancestry inference have been extensively validated in simulations (under many unrealistic assumptions), no empirical study of local ancestry accuracy in Latinos exists to date. Hence, interpreting findings that rely on local ancestry in Latinos is challenging.
Results: Here, we use 489 nuclear families from the mainland USA, Puerto Rico and Mexico in conjunction with 3204 unrelated Latinos from the Multiethnic Cohort study to provide the first empirical characterization of local ancestry inference accuracy in Latinos. Our approach for identifying errors does not rely on simulations but on the observation that local ancestry in families follows Mendelian inheritance. We measure the rate of local ancestry assignments that lead to Mendelian inconsistencies in local ancestry in trios (MILANC), which provides a lower bound on errors in the local ancestry estimates. We show that MILANC rates observed in simulations underestimate the rate observed in real data, and that MILANC varies substantially across the genome. Second, across a wide range of methods, we observe that loci with large deviations in local ancestry also show enrichment in MILANC rates. Therefore, local ancestry estimates at such loci should be interpreted with caution. Finally, we reconstruct ancestral haplotype panels to be used as reference panels in local ancestry inference and show that ancestry inference is significantly improved by incoroprating these reference panels.
Availability and implementation: We provide the reconstructed reference panels together with the maps of MILANC rates as a public resource for researchers analyzing local ancestry in Latinos at
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC3661056  PMID: 23572411
6.  An international effort towards developing standards for best practices in analysis, interpretation and reporting of clinical genome sequencing results in the CLARITY Challenge 
Brownstein, Catherine A | Beggs, Alan H | Homer, Nils | Merriman, Barry | Yu, Timothy W | Flannery, Katherine C | DeChene, Elizabeth T | Towne, Meghan C | Savage, Sarah K | Price, Emily N | Holm, Ingrid A | Luquette, Lovelace J | Lyon, Elaine | Majzoub, Joseph | Neupert, Peter | McCallie Jr, David | Szolovits, Peter | Willard, Huntington F | Mendelsohn, Nancy J | Temme, Renee | Finkel, Richard S | Yum, Sabrina W | Medne, Livija | Sunyaev, Shamil R | Adzhubey, Ivan | Cassa, Christopher A | de Bakker, Paul IW | Duzkale, Hatice | Dworzyński, Piotr | Fairbrother, William | Francioli, Laurent | Funke, Birgit H | Giovanni, Monica A | Handsaker, Robert E | Lage, Kasper | Lebo, Matthew S | Lek, Monkol | Leshchiner, Ignaty | MacArthur, Daniel G | McLaughlin, Heather M | Murray, Michael F | Pers, Tune H | Polak, Paz P | Raychaudhuri, Soumya | Rehm, Heidi L | Soemedi, Rachel | Stitziel, Nathan O | Vestecka, Sara | Supper, Jochen | Gugenmus, Claudia | Klocke, Bernward | Hahn, Alexander | Schubach, Max | Menzel, Mortiz | Biskup, Saskia | Freisinger, Peter | Deng, Mario | Braun, Martin | Perner, Sven | Smith, Richard JH | Andorf, Janeen L | Huang, Jian | Ryckman, Kelli | Sheffield, Val C | Stone, Edwin M | Bair, Thomas | Black-Ziegelbein, E Ann | Braun, Terry A | Darbro, Benjamin | DeLuca, Adam P | Kolbe, Diana L | Scheetz, Todd E | Shearer, Aiden E | Sompallae, Rama | Wang, Kai | Bassuk, Alexander G | Edens, Erik | Mathews, Katherine | Moore, Steven A | Shchelochkov, Oleg A | Trapane, Pamela | Bossler, Aaron | Campbell, Colleen A | Heusel, Jonathan W | Kwitek, Anne | Maga, Tara | Panzer, Karin | Wassink, Thomas | Van Daele, Douglas | Azaiez, Hela | Booth, Kevin | Meyer, Nic | Segal, Michael M | Williams, Marc S | Tromp, Gerard | White, Peter | Corsmeier, Donald | Fitzgerald-Butt, Sara | Herman, Gail | Lamb-Thrush, Devon | McBride, Kim L | Newsom, David | Pierson, Christopher R | Rakowsky, Alexander T | Maver, Aleš | Lovrečić, Luca | Palandačić, Anja | Peterlin, Borut | Torkamani, Ali | Wedell, Anna | Huss, Mikael | Alexeyenko, Andrey | Lindvall, Jessica M | Magnusson, Måns | Nilsson, Daniel | Stranneheim, Henrik | Taylan, Fulya | Gilissen, Christian | Hoischen, Alexander | van Bon, Bregje | Yntema, Helger | Nelen, Marcel | Zhang, Weidong | Sager, Jason | Zhang, Lu | Blair, Kathryn | Kural, Deniz | Cariaso, Michael | Lennon, Greg G | Javed, Asif | Agrawal, Saloni | Ng, Pauline C | Sandhu, Komal S | Krishna, Shuba | Veeramachaneni, Vamsi | Isakov, Ofer | Halperin, Eran | Friedman, Eitan | Shomron, Noam | Glusman, Gustavo | Roach, Jared C | Caballero, Juan | Cox, Hannah C | Mauldin, Denise | Ament, Seth A | Rowen, Lee | Richards, Daniel R | Lucas, F Anthony San | Gonzalez-Garay, Manuel L | Caskey, C Thomas | Bai, Yu | Huang, Ying | Fang, Fang | Zhang, Yan | Wang, Zhengyuan | Barrera, Jorge | Garcia-Lobo, Juan M | González-Lamuño, Domingo | Llorca, Javier | Rodriguez, Maria C | Varela, Ignacio | Reese, Martin G | De La Vega, Francisco M | Kiruluta, Edward | Cargill, Michele | Hart, Reece K | Sorenson, Jon M | Lyon, Gholson J | Stevenson, David A | Bray, Bruce E | Moore, Barry M | Eilbeck, Karen | Yandell, Mark | Zhao, Hongyu | Hou, Lin | Chen, Xiaowei | Yan, Xiting | Chen, Mengjie | Li, Cong | Yang, Can | Gunel, Murat | Li, Peining | Kong, Yong | Alexander, Austin C | Albertyn, Zayed I | Boycott, Kym M | Bulman, Dennis E | Gordon, Paul MK | Innes, A Micheil | Knoppers, Bartha M | Majewski, Jacek | Marshall, Christian R | Parboosingh, Jillian S | Sawyer, Sarah L | Samuels, Mark E | Schwartzentruber, Jeremy | Kohane, Isaac S | Margulies, David M
Genome Biology  2014;15(3):R53.
There is tremendous potential for genome sequencing to improve clinical diagnosis and care once it becomes routinely accessible, but this will require formalizing research methods into clinical best practices in the areas of sequence data generation, analysis, interpretation and reporting. The CLARITY Challenge was designed to spur convergence in methods for diagnosing genetic disease starting from clinical case history and genome sequencing data. DNA samples were obtained from three families with heritable genetic disorders and genomic sequence data were donated by sequencing platform vendors. The challenge was to analyze and interpret these data with the goals of identifying disease-causing variants and reporting the findings in a clinically useful format. Participating contestant groups were solicited broadly, and an independent panel of judges evaluated their performance.
A total of 30 international groups were engaged. The entries reveal a general convergence of practices on most elements of the analysis and interpretation process. However, even given this commonality of approach, only two groups identified the consensus candidate variants in all disease cases, demonstrating a need for consistent fine-tuning of the generally accepted methods. There was greater diversity of the final clinical report content and in the patient consenting process, demonstrating that these areas require additional exploration and standardization.
The CLARITY Challenge provides a comprehensive assessment of current practices for using genome sequencing to diagnose and report genetic diseases. There is remarkable convergence in bioinformatic techniques, but medical interpretation and reporting are areas that require further development by many groups.
PMCID: PMC4073084  PMID: 24667040
7.  CNVeM: Copy Number Variation Detection Using Uncertainty of Read Mapping 
Journal of Computational Biology  2013;20(3):224-236.
Copy number variations (CNVs) are widely known to be an important mediator for diseases and traits. The development of high-throughput sequencing (HTS) technologies has provided great opportunities to identify CNV regions in mammalian genomes. In a typical experiment, millions of short reads obtained from a genome of interest are mapped to a reference genome. The mapping information can be used to identify CNV regions. One important challenge in analyzing the mapping information is the large fraction of reads that can be mapped to multiple positions. Most existing methods either only consider reads that can be uniquely mapped to the reference genome or randomly place a read to one of its mapping positions. Therefore, these methods have low power to detect CNVs located within repeated sequences. In this study, we propose a probabilistic model, CNVeM, that utilizes the inherent uncertainty of read mapping. We use maximum likelihood to estimate locations and copy numbers of copied regions and implement an expectation-maximization (EM) algorithm. One important contribution of our model is that we can distinguish between regions in the reference genome that differ from each other by as little as 0.1%. As our model aims to predict the copy number of each nucleotide, we can predict the CNV boundaries with high resolution. We apply our method to simulated datasets and achieve higher accuracy compared to CNVnator. Moreover, we apply our method to real data from which we detected known CNVs. To our knowledge, this is the first attempt to predict CNVs at nucleotide resolution and to utilize uncertainty of read mapping.
PMCID: PMC3590897  PMID: 23421794
algorithms; next generation sequencing; statistical models; structural genomics
8.  Probabilistic Inference of Viral Quasispecies Subject to Recombination 
Journal of Computational Biology  2013;20(2):113-123.
RNA viruses exist in their hosts as populations of different but related strains. The virus population, often called quasispecies, is shaped by a combination of genetic change and natural selection. Genetic change is due to both point mutations and recombination events. We present a jumping hidden Markov model that describes the generation of viral quasispecies and a method to infer its parameters from next-generation sequencing data. The model introduces position-specific probability tables over the sequence alphabet to explain the diversity that can be found in the population at each site. Recombination events are indicated by a change of state, allowing a single observed read to originate from multiple sequences. We present a specific implementation of the expectation maximization (EM) algorithm to find maximum a posteriori estimates of the model parameters and a method to estimate the distribution of viral strains in the quasispecies. The model is validated on simulated data, showing the advantage of explicitly taking the recombination process into account, and applied to reads obtained from a clinical HIV sample.
PMCID: PMC3576916  PMID: 23383997
evolution; HMM; statistical models; viruses
9.  A Generic Coalescent-based Framework for the Selection of a Reference Panel for Imputation 
Genetic epidemiology  2010;34(8):10.1002/gepi.20505.
An important component in the analysis of genome-wide association studies involves the imputation of genotypes that have not been measured directly in the studied samples. The imputation procedure uses the linkage disequilibrium (LD) structure in the population to infer the genotype of an unobserved single nucleotide polymorphism. The LD structure is normally learned from a dense genotype map of a reference population that matches the studied population. In many instances there is no reference population that exactly matches the studied population, and a natural question arises as to how to choose the reference population for the imputation. Here we present a Coalescent-based method that addresses this issue. In contrast to the current paradigm of imputation methods, our method assigns a different reference dataset for each sample in the studied population, and for each region in the genome. This allows the flexibility to account for the diversity within populations, as well as across populations. Furthermore, because our approach treats each region in the genome separately, our method is suitable for the imputation of recently admixed populations. We evaluated our method across a large set of populations and found that our choice of reference data set considerably improves the accuracy of imputation, especially for regions with low LD and for populations without a reference population available as well as for admixed populations such as the Hispanic population. Our method is generic and can potentially be incorporated in any of the available imputation methods as an add-on.
PMCID: PMC3876740  PMID: 21058333
genotype imputation; coalescent; GWAS; linkage disequilibrium; weighted panel
10.  A model-based approach for analysis of spatial structure in genetic data 
Nature genetics  2012;44(6):725-731.
Characterizing genetic diversity within and between populations has broad applications in studies of human disease and evolution. We propose a new approach, spatial ancestry analysis, for the modeling of genotypes in two- or three-dimensional space. In spatial ancestry analysis (SPA), we explicitly model the spatial distribution of each SNP by assigning an allele frequency as a continuous function in geographic space. We show that the explicit modeling of the allele frequency allows individuals to be localized on the map on the basis of their genetic information alone. We apply our SPA method to a European and a worldwide population genetic variation data set and identify SNPs showing large gradients in allele frequency, and we suggest these as candidate regions under selection. These regions include SNPs in the well-characterized LCT region, as well as at loci including FOXP2, OCA2 and LRP1B.
PMCID: PMC3592563  PMID: 22610118
11.  Fast and accurate inference of local ancestry in Latino populations 
Bioinformatics  2012;28(10):1359-1367.
Motivation: It is becoming increasingly evident that the analysis of genotype data from recently admixed populations is providing important insights into medical genetics and population history. Such analyses have been used to identify novel disease loci, to understand recombination rate variation and to detect recent selection events. The utility of such studies crucially depends on accurate and unbiased estimation of the ancestry at every genomic locus in recently admixed populations. Although various methods have been proposed and shown to be extremely accurate in two-way admixtures (e.g. African Americans), only a few approaches have been proposed and thoroughly benchmarked on multi-way admixtures (e.g. Latino populations of the Americas).
Results: To address these challenges we introduce here methods for local ancestry inference which leverage the structure of linkage disequilibrium in the ancestral population (LAMP-LD), and incorporate the constraint of Mendelian segregation when inferring local ancestry in nuclear family trios (LAMP-HAP). Our algorithms uniquely combine hidden Markov models (HMMs) of haplotype diversity within a novel window-based framework to achieve superior accuracy as compared with published methods. Further, unlike previous methods, the structure of our HMM does not depend on the number of reference haplotypes but on a fixed constant, and it is thereby capable of utilizing large datasets while remaining highly efficient and robust to over-fitting. Through simulations and analysis of real data from 489 nuclear trio families from the mainland US, Puerto Rico and Mexico, we demonstrate that our methods achieve superior accuracy compared with published methods for local ancestry inference in Latinos.
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC3348558  PMID: 22495753
12.  Chemokine polymorphisms and lymphoma: a pooled analysis 
Leukemia & lymphoma  2010;51(3):497-506.
Polymorphisms in chemokine genes have been associated with human immunodeficiency virus (HIV)-related non-Hodgkin lymphoma (NHL) but are understudied in non-HIV-related NHL. Associations of NHL and NHL subtypes with polymorphisms and haplotypes in CCR5, CCR2, CCL5, CXCL12 and CX3CR1 were explored in a pooled analysis of three case-control studies (San Francisco Bay Area, California; United Kingdom; total: cases N=1610, controls N=1992). Adjusted unconditional logistic regression was used to estimate relative risks among HIV-negative non-Hispanic Caucasians. The CCR5M Δ32 deletion reduced the risk of NHL (odds ratio=0.56, 95% confidence interval=0.38-0.83) in men but not women with similar effects observed for diffuse large-cell and follicular lymphoma (FL). NHL risk also was reduced in men with the CCR2/CCR5 haplotype characterized by the CCR5 Δ32 deletion. The CCL5 −403A allele conferred reduced risks of FL and chronic lymphocytic leukemia/small lymphocytic lymphoma. Results should be interpreted conservatively. Continued investigation is warranted to confirm these findings.
PMCID: PMC3443685  PMID: 20038229
Lymphoma non-Hodgkin; Chemokines; Polymorphism, genetic; Case-Control
13.  Inhibition of activated pericentromeric SINE/Alu repeat transcription in senescent human adult stem cells reinstates self-renewal 
Cell Cycle  2011;10(17):3016-3030.
Cellular aging is linked to deficiencies in efficient repair of DNA double strand breaks and authentic genome maintenance at the chromatin level. Aging poses a significant threat to adult stem cell function by triggering persistent DNA damage and ultimately cellular senescence. Senescence is often considered to be an irreversible process. Moreover, critical genomic regions engaged in persistent DNA damage accumulation are unknown. Here we report that 65% of naturally occurring repairable DNA damage in self-renewing adult stem cells occurs within transposable elements. Upregulation of Alu retrotransposon transcription upon ex vivo aging causes nuclear cytotoxicity associated with the formation of persistent DNA damage foci and loss of efficient DNA repair in pericentric chromatin. This occurs due to a failure to recruit of condensin I and cohesin complexes. Our results demonstrate that the cytotoxicity of induced Alu repeats is functionally relevant for the human adult stem cell aging. Stable suppression of Alu transcription can reverse the senescent phenotype, reinstating the cells' self-renewing properties and increasing their plasticity by altering so-called “master” pluripotency regulators.
PMCID: PMC3218602  PMID: 21862875
adult stem cells; senescence; SINE/Alu transposons; DNA damage; H2AX; ChIP-seq; cohesin; condensin; PML body; induced pluripotency
14.  A search for overlapping susceptibility loci between non-Hodgkin lymphoma and autoimmune diseases 
Genomics  2011;98(1):9-14.
Non-Hodgkin lymphoma (NHL) is a hematological malignancy of the immune system, and, as with autoimmune and inflammatory diseases (ADs), is influenced by genetic variation in the major histocompatibility complex (MHC). Persons with a history of specific ADs also have increased risk of NHL. As the coexistence of ADs and NHL could be caused by factors common to both diseases, here we examined whether some of the associated genetic signals are shared. Overlapping risk loci for NHL subytpes and several ADs were explored using data from genome-wide association studies. Several common genomic regions and susceptibility loci were identified suggesting a potential shared genetic background. Two independent MHC regions showed the main overlap, with several alleles in the human leukocyte antigen (HLA) Class II region exhibiting an opposite risk effect for follicular lymphoma and type I diabetes. These results support continued investigation to further elucidate the relationship between lymphoma and autoimmune diseases.
PMCID: PMC3129413  PMID: 21439368
Non-Hodgkin lymphoma; Autoimmune diseases; Genome-wide Association Studies; Human Leukocyte Antigen
15.  Haplotype reconstruction using perfect phylogeny and sequence data 
BMC Bioinformatics  2012;13(Suppl 6):S3.
Haplotype phasing is a well studied problem in the context of genotype data. With the recent developments in high-throughput sequencing, new algorithms are needed for haplotype phasing, when the number of samples sequenced is low and when the sequencing coverage is blow. High-throughput sequencing technologies enables new possibilities for the inference of haplotypes. Since each read is originated from a single chromosome, all the variant sites it covers must derive from the same haplotype. Moreover, the sequencing process yields much higher SNP density than previous methods, resulting in a higher correlation between neighboring SNPs. We offer a new approach for haplotype phasing, which leverages on these two properties. Our suggested algorithm, called Perfect Phlogeny Haplotypes from Sequencing (PPHS) uses a perfect phylogeny model and it models the sequencing errors explicitly. We evaluated our method on real and simulated data, and we demonstrate that the algorithm outperforms previous methods when the sequencing error rate is high or when coverage is low.
PMCID: PMC3330028  PMID: 22537042
16.  MGMR: leveraging RNA-Seq population data to optimize expression estimation 
BMC Bioinformatics  2012;13(Suppl 6):S2.
RNA-Seq is a technique that uses Next Generation Sequencing to identify transcripts and estimate transcription levels. When applying this technique for quantification, one must contend with reads that align to multiple positions in the genome (multireads). Previous efforts to resolve multireads have shown that RNA-Seq expression estimation can be improved using probabilistic allocation of reads to genes. These methods use a probabilistic generative model for data generation and resolve ambiguity using likelihood-based approaches. In many instances, RNA-seq experiments are performed in the context of a population. The generative models of current methods do not take into account such population information, and it is an open question whether this information can improve quantification of the individual samples
In order to explore the contribution of population level information in RNA-seq quantification, we apply a hierarchical probabilistic generative model, which assumes that expression levels of different individuals are sampled from a Dirichlet distribution with parameters specific to the population, and reads are sampled from the distribution of expression levels. We introduce an optimization procedure for the estimation of the model parameters, and use HapMap data and simulated data to demonstrate that the model yields a significant improvement in the accuracy of expression levels of paralogous genes.
We provide a proof of principal of the benefit of drawing on population commonalities to estimate expression. The results of our experiments demonstrate this approach can be beneficial, primarily for estimation at the gene level.
PMCID: PMC3358656  PMID: 22537041
17.  Meta-analysis of genome-wide association studies from the CHARGE consortium identifies common variants associated with carotid intima media thickness and plaque 
Bis, Joshua C. | Kavousi, Maryam | Franceschini, Nora | Isaacs, Aaron | Abecasis, Gonçalo R | Schminke, Ulf | Post, Wendy | Smith, Albert V. | Cupples, L. Adrienne | Markus, Hugh S | Schmidt, Reinhold | Huffman, Jennifer E. | Lehtimäki, Terho | Baumert, Jens | Münzel, Thomas | Heckbert, Susan R. | Dehghan, Abbas | North, Kari | Oostra, Ben | Bevan, Steve | Stoegerer, Eva-Maria | Hayward, Caroline | Raitakari, Olli | Meisinger, Christa | Schillert, Arne | Sanna, Serena | Völzke, Henry | Cheng, Yu-Ching | Thorsson, Bolli | Fox, Caroline S. | Rice, Kenneth | Rivadeneira, Fernando | Nambi, Vijay | Halperin, Eran | Petrovic, Katja E. | Peltonen, Leena | Wichmann, H. Erich | Schnabel, Renate B. | Dörr, Marcus | Parsa, Afshin | Aspelund, Thor | Demissie, Serkalem | Kathiresan, Sekar | Reilly, Muredach P. | Uitterlinden, Andre | Couper, David J. | Sitzer, Matthias | Kähönen, Mika | Illig, Thomas | Wild, Philipp S. | Orru, Marco | Lüdemann, Jan | Shuldiner, Alan R. | Eiriksdottir, Gudny | White, Charles C. | Rotter, Jerome I. | Hofman, Albert | Seissler, Jochen | Zeller, Tanja | Usala, Gianluca | Ernst, Florian | Launer, Lenore J. | D'Agostino, Ralph B. | O'Leary, Daniel H. | Ballantyne, Christie | Thiery, Joachim | Ziegler, Andreas | Lakatta, Edward G. | Chilukoti, Ravi Kumar | Harris, Tamara B. | Wolf, Philip A. | Psaty, Bruce M. | Polak, Joseph F | Li, Xia | Rathmann, Wolfgang | Uda, Manuela | Boerwinkle, Eric | Klopp, Norman | Schmidt, Helena | Wilson, James F | Viikari, Jorma | Koenig, Wolfgang | Blankenberg, Stefan | Newman, Anne B. | Witteman, Jacqueline | Heiss, Gerardo | van Duijn, Cornelia | Scuteri, Angelo | Homuth, Georg | Mitchell, Braxton D. | Gudnason, Vilmundur | O’Donnell, Christopher J.
Nature Genetics  2011;43(10):940-947.
PMCID: PMC3257519  PMID: 21909108
genome-wide association study; genetic epidemiology; genetics; subclinical atherosclerosis; carotid intima media thickness; cardiovascular disease; cohort study; meta-analysis; risk
18.  Joint Analysis of Multiple Metagenomic Samples 
PLoS Computational Biology  2012;8(2):e1002373.
The availability of metagenomic sequencing data, generated by sequencing DNA pooled from multiple microbes living jointly, has increased sharply in the last few years with developments in sequencing technology. Characterizing the contents of metagenomic samples is a challenging task, which has been extensively attempted by both supervised and unsupervised techniques, each with its own limitations. Common to practically all the methods is the processing of single samples only; when multiple samples are sequenced, each is analyzed separately and the results are combined. In this paper we propose to perform a combined analysis of a set of samples in order to obtain a better characterization of each of the samples, and provide two applications of this principle. First, we use an unsupervised probabilistic mixture model to infer hidden components shared across metagenomic samples. We incorporate the model in a novel framework for studying association of microbial sequence elements with phenotypes, analogous to the genome-wide association studies performed on human genomes: We demonstrate that stratification may result in false discoveries of such associations, and that the components inferred by the model can be used to correct for this stratification. Second, we propose a novel read clustering (also termed “binning”) algorithm which operates on multiple samples simultaneously, leveraging on the assumption that the different samples contain the same microbial species, possibly in different proportions. We show that integrating information across multiple samples yields more precise binning on each of the samples. Moreover, for both applications we demonstrate that given a fixed depth of coverage, the average per-sample performance generally increases with the number of sequenced samples as long as the per-sample coverage is high enough.
Author Summary
Microorganisms are extremely abundant and diverse, and occupy almost every habitat on earth. Most of these habitats contain a complex mixture of many different microorganisms, and the characterization of these metagenomic mixtures, in terms of both taxonomy and function, is of great interest to science and medicine. Current sequencing technologies produce large numbers of short DNA reads copied from the genomes of a metagenomic sample, which can be used to obtain a high resolution characterization of such samples. However, the analysis of such data is complicated by the fact that one cannot tell which sequencing reads originated from the same genome. We show that the joint analysis of multiple metagenomic samples, which takes advantage of the fact that the samples share common microbial types, achieves better single-sample characterization compared to the current analysis methods that operate on single samples only. We demonstrate how this approach can be used to infer microbial components without the use of external sequence data, and to cluster sequencing reads according to their species of origin. In both cases we show that the joint analysis enhances the average single-sample performance, thus providing better sample characterization.
PMCID: PMC3280959  PMID: 22359490
19.  A New Stress-Based Model of Political Extremism 
Does exposure to terrorism lead to hostility toward minorities? Drawing on theories from clinical and social psychology, we propose a stress-based model of political extremism in which psychological distress—which is largely overlooked in political scholarship—and threat perceptions mediate the relationship between exposure to terrorism and attitudes toward minorities. To test the model, a representative sample of 469 Israeli Jewish respondents was interviewed on three occasions at six-month intervals. Structural Equation Modeling indicated that exposure to terrorism predicted psychological distress (t1), which predicted perceived threat from Palestinian citizens of Israel (t2), which, in turn, predicted exclusionist attitudes toward Palestinian citizens of Israel (t3). These findings provide solid evidence and a mechanism for the hypothesis that terrorism introduces nondemocratic attitudes threatening minority rights. It suggests that psychological distress plays an important role in political decision making and should be incorporated in models drawing upon political psychology.
PMCID: PMC3229259  PMID: 22140275
terrorism; stress; psychological distress; threat perceptions; minority rights; political attitudes; extremism
20.  Authoritarianism, perceived threat and exclusionism on the eve of the Disengagement: Evidence from Gaza 
Major political events such as terrorist attacks and forced relocation of citizens may have an immediate effect on attitudes towards ethnic minorities associated with these events. The psychological process that leads to political exclusionism of minority groups was examined using a field study among Israeli settlers in Gaza days prior to the Disengagement Plan adopted by the Israeli government on June 6, 2004 and enacted in August 2005. Lending credence to integrated threat theory and to theory on authoritarianism, our analyses show that the positive effect of religiosity on political exclusionism results from the two-staged mediation of authoritarianism and perceived threat. We conclude that religiosity fosters authoritarianism, which in turn tends to move people towards exclusionism both directly and through the mediation of perceived threat.
PMCID: PMC3229268  PMID: 22140286
Exclusionism; Authoritarianism; Perceived threat; Terrorist attacks
21.  Terror, Resource Gains and Exclusionist Political Attitudes among New Immigrants and Veteran Israelis 
This study analyses the antecedents of exclusionist political attitudes towards Palestinian citizens of Israel among Israeli immigrants from the former Soviet Union in comparison to Old Jewish Israelis (OJI). A large-scale study of exclusionist political attitudes was conducted in the face of ongoing terrorism in Israel through telephone surveys carried out in September 2003 with 641 OJI and 131 immigrants. The main goal of the survey was to estimate the influence of perceived loss and gain of resources—as a consequence of terror—on attitudes towards Palestinian Israelis, while controlling for other relevant predictors of exclusionism—i.e. authoritarianism or threat perception. Findings obtained via interaction analyses and structural equation modelling show that a) immigrants display higher levels of exclusionist political attitudes towards Palestinian citizens of Israel than OJI; b) loss of resources, authoritarianism, and hawkish (rightist) worldviews predict exclusionist political attitudes among both immigrants and non-immigrants; c) failure to undergo post-traumatic growth (resource gain) in response to terrorism (e.g. finding meaning in life, becoming closer to others) is a significant predictor of exclusionist political attitudes only among immigrants.
PMCID: PMC3226700  PMID: 22140351
Ethnic Relations; Intolerance; Israel; Arabs; Immigration; Terror
22.  Large-scale association analyses identifies 13 new susceptibility loci for coronary artery disease 
Schunkert, Heribert | König, Inke R. | Kathiresan, Sekar | Reilly, Muredach P. | Assimes, Themistocles L. | Holm, Hilma | Preuss, Michael | Stewart, Alexandre F. R. | Barbalic, Maja | Gieger, Christian | Absher, Devin | Aherrahrou, Zouhair | Allayee, Hooman | Altshuler, David | Anand, Sonia S. | Andersen, Karl | Anderson, Jeffrey L. | Ardissino, Diego | Ball, Stephen G. | Balmforth, Anthony J. | Barnes, Timothy A. | Becker, Diane M. | Becker, Lewis C. | Berger, Klaus | Bis, Joshua C. | Boekholdt, S. Matthijs | Boerwinkle, Eric | Braund, Peter S. | Brown, Morris J. | Burnett, Mary Susan | Buysschaert, Ian | Carlquist, Cardiogenics, John F. | Chen, Li | Cichon, Sven | Codd, Veryan | Davies, Robert W. | Dedoussis, George | Dehghan, Abbas | Demissie, Serkalem | Devaney, Joseph M. | Do, Ron | Doering, Angela | Eifert, Sandra | El Mokhtari, Nour Eddine | Ellis, Stephen G. | Elosua, Roberto | Engert, James C. | Epstein, Stephen E. | Faire, Ulf de | Fischer, Marcus | Folsom, Aaron R. | Freyer, Jennifer | Gigante, Bruna | Girelli, Domenico | Gretarsdottir, Solveig | Gudnason, Vilmundur | Gulcher, Jeffrey R. | Halperin, Eran | Hammond, Naomi | Hazen, Stanley L. | Hofman, Albert | Horne, Benjamin D. | Illig, Thomas | Iribarren, Carlos | Jones, Gregory T. | Jukema, J.Wouter | Kaiser, Michael A. | Kaplan, Lee M. | Kastelein, John J.P. | Khaw, Kay-Tee | Knowles, Joshua W. | Kolovou, Genovefa | Kong, Augustine | Laaksonen, Reijo | Lambrechts, Diether | Leander, Karin | Lettre, Guillaume | Li, Mingyao | Lieb, Wolfgang | Linsel-Nitschke, Patrick | Loley, Christina | Lotery, Andrew J. | Mannucci, Pier M. | Maouche, Seraya | Martinelli, Nicola | McKeown, Pascal P. | Meisinger, Christa | Meitinger, Thomas | Melander, Olle | Merlini, Pier Angelica | Mooser, Vincent | Morgan, Thomas | Mühleisen, Thomas W. | Muhlestein, Joseph B. | Münzel, Thomas | Musunuru, Kiran | Nahrstaedt, Janja | Nelson, Christopher P. | Nöthen, Markus M. | Olivieri, Oliviero | Patel, Riyaz S. | Patterson, Chris C. | Peters, Annette | Peyvandi, Flora | Qu, Liming | Quyyumi, Arshed A. | Rader, Daniel J. | Rallidis, Loukianos S. | Rice, Catherine | Rosendaal, Frits R. | Rubin, Diana | Salomaa, Veikko | Sampietro, M. Lourdes | Sandhu, Manj S. | Schadt, Eric | Schäfer, Arne | Schillert, Arne | Schreiber, Stefan | Schrezenmeir, Jürgen | Schwartz, Stephen M. | Siscovick, David S. | Sivananthan, Mohan | Sivapalaratnam, Suthesh | Smith, Albert | Smith, Tamara B. | Snoep, Jaapjan D. | Soranzo, Nicole | Spertus, John A. | Stark, Klaus | Stirrups, Kathy | Stoll, Monika | Tang, W. H. Wilson | Tennstedt, Stephanie | Thorgeirsson, Gudmundur | Thorleifsson, Gudmar | Tomaszewski, Maciej | Uitterlinden, Andre G. | van Rij, Andre M. | Voight, Benjamin F. | Wareham, Nick J. | Wells, George A. | Wichmann, H.-Erich | Wild, Philipp S. | Willenborg, Christina | Witteman, Jaqueline C. M. | Wright, Benjamin J. | Ye, Shu | Zeller, Tanja | Ziegler, Andreas | Cambien, Francois | Goodall, Alison H. | Cupples, L. Adrienne | Quertermous, Thomas | März, Winfried | Hengstenberg, Christian | Blankenberg, Stefan | Ouwehand, Willem H. | Hall, Alistair S. | Deloukas, Panos | Thompson, John R. | Stefansson, Kari | Roberts, Robert | Thorsteinsdottir, Unnur | O’Donnell, Christopher J. | McPherson, Ruth | Erdmann, Jeanette | Samani, Nilesh J.
Nature genetics  2011;43(4):333-338.
We performed a meta-analysis of 14 genome-wide association studies of coronary artery disease (CAD) comprising 22,233 cases and 64,762 controls of European descent, followed by genotyping of top association signals in 60,738 additional individuals. This genomic analysis identified 13 novel loci harboring one or more SNPs that were associated with CAD at P<5×10−8 and confirmed the association of 10 of 12 previously reported CAD loci. The 13 novel loci displayed risk allele frequencies ranging from 0.13 to 0.91 and were associated with a 6 to 17 percent increase in the risk of CAD per allele. Notably, only three of the novel loci displayed significant association with traditional CAD risk factors, while the majority lie in gene regions not previously implicated in the pathogenesis of CAD. Finally, five of the novel CAD risk loci appear to have pleiotropic effects, showing strong association with various other human diseases or traits.
PMCID: PMC3119261  PMID: 21378990
23.  Matrin 3 Binds and Stabilizes mRNA 
PLoS ONE  2011;6(8):e23882.
Matrin 3 (MATR3) is a highly conserved, inner nuclear matrix protein with two zinc finger domains and two RNA recognition motifs (RRM), whose function is largely unknown. Recently we found MATR3 to be phosphorylated by the protein kinase ATM, which activates the cellular response to double strand breaks in the DNA. Here, we show that MATR3 interacts in an RNA-dependent manner with several proteins with established roles in RNA processing, and maintains its interaction with RNA via its RRM2 domain. Deep sequencing of the bound RNA (RIP-seq) identified several small noncoding RNA species. Using microarray analysis to explore MATR3′s role in transcription, we identified 77 transcripts whose amounts depended on the presence of MATR3. We validated this finding with nine transcripts which were also bound to the MATR3 complex. Finally, we demonstrated the importance of MATR3 for maintaining the stability of several of these mRNA species and conclude that it has a role in mRNA stabilization. The data suggest that the cellular level of MATR3, known to be highly regulated, modulates the stability of a group of gene transcripts.
PMCID: PMC3157474  PMID: 21858232
24.  Genotyping common and rare variation using overlapping pool sequencing 
BMC Bioinformatics  2011;12(Suppl 6):S2.
Recent advances in sequencing technologies set the stage for large, population based studies, in which the ANA or RNA of thousands of individuals will be sequenced. Currently, however, such studies are still infeasible using a straightforward sequencing approach; as a result, recently a few multiplexing schemes have been suggested, in which a small number of ANA pools are sequenced, and the results are then deconvoluted using compressed sensing or similar approaches. These methods, however, are limited to the detection of rare variants.
In this paper we provide a new algorithm for the deconvolution of DNA pools multiplexing schemes. The presented algorithm utilizes a likelihood model and linear programming. The approach allows for the addition of external data, particularly imputation data, resulting in a flexible environment that is suitable for different applications.
Particularly, we demonstrate that both low and high allele frequency SNPs can be accurately genotyped when the DNA pooling scheme is performed in conjunction with microarray genotyping and imputation. Additionally, we demonstrate the use of our framework for the detection of cancer fusion genes from RNA sequences.
PMCID: PMC3194190  PMID: 21989232
25.  Design of the Coronary ARtery DIsease Genome-Wide Replication And Meta-Analysis (CARDIoGRAM) Study 
Recent genome-wide association studies (GWAS) of myocardial infarction (MI) and other forms of coronary artery disease (CAD) have led to the discovery of at least 13 genetic loci. In addition to the effect size, power to detect associations is largely driven by sample size. Therefore, to maximize the chance of finding novel susceptibility loci for CAD and MI, the Coronary ARtery DIsease Genome-wide Replication And Meta-analysis (CARDIoGRAM) consortium was formed.
Methods and Results
CARDIoGRAM combines data from all published and several unpublished GWAS in individuals with European ancestry; includes >22 000 cases with CAD, MI, or both and >60 000 controls; and unifies samples from the Atherosclerotic Disease VAscular functioN and genetiC Epidemiology study, CADomics, Cohorts for Heart and Aging Research in Genomic Epidemiology, deCODE, the German Myocardial Infarction Family Studies I, II, and III, Ludwigshafen Risk and Cardiovascular Heath Study/AtheroRemo, MedStar, Myocardial Infarction Genetics Consortium, Ottawa Heart Genomics Study, PennCath, and the Wellcome Trust Case Control Consortium. Genotyping was carried out on Affymetrix or Illumina platforms followed by imputation of genotypes in most studies. On average, 2.2 million single nucleotide polymorphisms were generated per study. The results from each study are combined using meta-analysis. As proof of principle, we meta-analyzed risk variants at 9p21 and found that rs1333049 confers a 29% increase in risk for MI per copy (P=2×10−20).
CARDIoGRAM is poised to contribute to our understanding of the role of common genetic variation on risk for CAD and MI.
PMCID: PMC3070269  PMID: 20923989
coronary artery disease; myocardial infarction; meta-analysis; genetics

Results 1-25 (32)