Motivation: Recent advances in genotyping technology has made data acquisition for whole-genome association study cost effective, and a current active area of research is developing efficient methods to analyze such large-scale datasets. Most sophisticated association mapping methods that are currently available take phased haplotype data as input. However, phase information is not readily available from sequencing methods and inferring the phase via computational approaches is time-consuming, taking days to phase a single chromosome.
Results: In this article, we devise an efficient method for scanning unphased whole-genome data for association. Our approach combines a recently found linear-time algorithm for phasing genotypes on trees with a recently proposed tree-based method for association mapping. From unphased genotype data, our algorithm builds local phylogenies along the genome, and scores each tree according to the clustering of cases and controls. We assess the performance of our new method on both simulated and real biological datasets.
Availability The software described in this article is available at http://www.daimi.au.dk/~mailund/Blossoc and distributed under the GNU General Public License.
Maximum parsimony phylogenetic tree reconstruction from genetic variation data is a fundamental problem in computational genetics with many practical applications in population genetics, whole genome analysis, and the search for genetic predictors of disease. Efficient methods are available for reconstruction of maximum parsimony trees from haplotype data, but such data are difficult to determine directly for autosomal DNA. Data more commonly is available in the form of genotypes, which consist of conflated combinations of pairs of haplotypes from homologous chromosomes. Currently, there are no general algorithms for the direct reconstruction of maximum parsimony phylogenies from genotype data. Hence phylogenetic applications for autosomal data must therefore rely on other methods for first computationally inferring haplotypes from genotypes.
In this work, we develop the first practical method for computing maximum parsimony phylogenies directly from genotype data. We show that the standard practice of first inferring haplotypes from genotypes and then reconstructing a phylogeny on the haplotypes often substantially overestimates phylogeny size. As an immediate application, our method can be used to determine the minimum number of mutations required to explain a given set of observed genotypes.
Phylogeny reconstruction directly from unphased data is computationally feasible for moderate-sized problem instances and can lead to substantially more accurate tree size inferences than the standard practice of treating phasing and phylogeny construction as two separate analysis stages. The difference between the approaches is particularly important for downstream applications that require a lower-bound on the number of mutations that the genetic region has undergone.
With current technology, vast amounts of data can be cheaply and efficiently produced in association studies, and to prevent data analysis to become the bottleneck of studies, fast and efficient analysis methods that scale to such data set sizes must be developed.
We present a fast method for accurate localisation of disease causing variants in high density case-control association mapping experiments with large numbers of cases and controls. The method searches for significant clustering of case chromosomes in the "perfect" phylogenetic tree defined by the largest region around each marker that is compatible with a single phylogenetic tree. This perfect phylogenetic tree is treated as a decision tree for determining disease status, and scored by its accuracy as a decision tree. The rationale for this is that the perfect phylogeny near a disease affecting mutation should provide more information about the affected/unaffected classification than random trees. If regions of compatibility contain few markers, due to e.g. large marker spacing, the algorithm can allow the inclusion of incompatibility markers in order to enlarge the regions prior to estimating their phylogeny. Haplotype data and phased genotype data can be analysed. The power and efficiency of the method is investigated on 1) simulated genotype data under different models of disease determination 2) artificial data sets created from the HapMap ressource, and 3) data sets used for testing of other methods in order to compare with these. Our method has the same accuracy as single marker association (SMA) in the simplest case of a single disease causing mutation and a constant recombination rate. However, when it comes to more complex scenarios of mutation heterogeneity and more complex haplotype structure such as found in the HapMap data our method outperforms SMA as well as other fast, data mining approaches such as HapMiner and Haplotype Pattern Mining (HPM) despite being significantly faster. For unphased genotype data, an initial step of estimating the phase only slightly decreases the power of the method. The method was also found to accurately localise the known susceptibility variants in an empirical data set – the ΔF508 mutation for cystic fibrosis – where the susceptibility variant is already known – and to find significant signals for association between the CYP2D6 gene and poor drug metabolism, although for this dataset the highest association score is about 60 kb from the CYP2D6 gene.
Our method has been implemented in the Blossoc (BLOck aSSOCiation) software. Using Blossoc, genome wide chip-based surveys of 3 million SNPs in 1000 cases and 1000 controls can be analysed in less than two CPU hours.
Haplotype inference based on unphased SNP markers is an important task in population genetics. Although there are different approaches to the inference of haplotypes in diploid species, the existing software is not suitable for inferring haplotypes from unphased SNP data in polyploid species, such as the cultivated potato (Solanum tuberosum). Potato species are tetraploid and highly heterozygous.
Here we present the software SATlotyper which is able to handle polyploid and polyallelic data. SATlo-typer uses the Boolean satisfiability problem to formulate Haplotype Inference by Pure Parsimony. The software excludes existing haplotype inferences, thus allowing for calculation of alternative inferences. As it is not known which of the multiple haplotype inferences are best supported by the given unphased data set, we use a bootstrapping procedure that allows for scoring of alternative inferences. Finally, by means of the bootstrapping scores, it is possible to optimise the phased genotypes belonging to a given haplotype inference. The program is evaluated with simulated and experimental SNP data generated for heterozygous tetraploid populations of potato. We show that, instead of taking the first haplotype inference reported by the program, we can significantly improve the quality of the final result by applying additional methods that include scoring of the alternative haplotype inferences and genotype optimisation. For a sub-population of nineteen individuals, the predicted results computed by SATlotyper were directly compared with results obtained by experimental haplotype inference via sequencing of cloned amplicons. Prediction and experiment gave similar results regarding the inferred haplotypes and phased genotypes.
Our results suggest that Haplotype Inference by Pure Parsimony can be solved efficiently by the SAT approach, even for data sets of unphased SNP from heterozygous polyploids. SATlotyper is freeware and is distributed as a Java JAR file. The software can be downloaded from the webpage of the GABI Primary Database at . The application of SATlotyper will provide haplotype information, which can be used in haplotype association mapping studies of polyploid plants.
We recently described a method for linkage disequilibrium (LD) mapping, using cladistic analysis of phased single-nucleotide polymorphism (SNP) haplotypes in a logistic regression framework. However, haplotypes are often not available and cannot be deduced with certainty from the unphased genotypes. One possible two-stage approach is to infer the phase of multilocus genotype data and analyze the resulting haplotypes as if known. Here, haplotypes are inferred using the expectation-maximization (EM) algorithm and the best-guess phase assignment for each individual analyzed. However, inferring haplotypes from phase-unknown data is prone to error and this should be taken into account in the subsequent analysis. An alternative approach is to analyze the phase-unknown multilocus genotypes themselves. Here we present a generalization of the method for phase-known haplotype data to the case of unphased SNP genotypes. Our approach is designed for high-density SNP data, so we opted to analyze the simulated dataset. The marker spacing in the initial screen was too large for our method to be effective, so we used the answers provided to request further data in regions around the disease loci and in null regions. Power to detect the disease loci, accuracy in localizing the true site of the locus, and false-positive error rates are reported for the inferred-haplotype and unphased genotype methods. For this data, analyzing inferred haplotypes outperforms analysis of genotypes. As expected, our results suggest that when there is little or no LD between a disease locus and the flanking region, there will be no chance of detecting it unless the disease variant itself is genotyped.
Haplotypes extracted from human DNA can be used for gene mapping and other analysis of genetic patterns within and across populations. A fundamental problem is, however, that current practical laboratory methods do not give haplotype information. Estimation of phased haplotypes of unrelated individuals given their unphased genotypes is known as the haplotype reconstruction or phasing problem.
We define three novel statistical models and give an efficient algorithm for haplotype reconstruction, jointly called HaploRec. HaploRec is based on exploiting local regularities conserved in haplotypes: it reconstructs haplotypes so that they have maximal local coherence. This approach – not assuming statistical dependence for remotely located markers – has two useful properties: it is well-suited for sparse marker maps, such as those used in gene mapping, and it can actually take advantage of long maps.
Our experimental results with simulated and real data show that HaploRec is a powerful method for the large scale haplotyping needed in association studies. With sample sizes large enough for gene mapping it appeared to be the best compared to all other tested methods (Phase, fastPhase, PL-EM, Snphap, Gerbil; simulated data), with small samples it was competitive with the best available methods (real data). HaploRec is several orders of magnitude faster than Phase and comparable to the other methods; the running times are roughly linear in the number of subjects and the number of markers. HaploRec is publicly available at .
In many contexts, pedigrees for individuals are known even though not all individuals have been fully genotyped. In one extreme case, the genotypes for a set of full siblings are known, with no knowledge of parental genotypes. We propose a method for inferring phased haplotypes and genotypes for all individuals, even those with missing data, in such pedigrees, allowing a multitude of classic and recent methods for linkage and genome analysis to be used more efficiently.
By artificially removing the founder generation genotype data from a well-studied simulated dataset, the quality of reconstructed genotypes in that generation can be verified. For the full structure of repeated matings with 15 offspring per mating, 10 dams per sire, 99.89%
of all founder markers were phased correctly, given only the unphased genotypes for offspring. The accuracy was reduced only slightly, to 99.51%, when introducing a 2% error rate in offspring genotypes. When reduced to only 5 full-sib offspring in a single sire-dam mating, the corresponding percentage is 92.62%, which compares favorably with 89.28%
from the leading Merlin package. Furthermore, Merlin is unable to handle more than approximately 10 sibs, as the number of states tracked rises exponentially with family size, while our approach has no such limit and handles 150 half-sibs with ease in our experiments.
Our method is able to reconstruct genotypes for parents when genotype data is only available for offspring individuals, as well as haplotypes for all individuals. Compared to the Merlin package, we can handle larger pedigrees and produce superior results, mainly due to the fact that Merlin uses the Viterbi algorithm on the state space to infer the genotype sequence. Tracking of haplotype and allele origin can be used in any application where the marker set does not directly influence genotype variation influencing traits. Inference of genotypes can also reduce the effects of genotyping errors and missing data. The cnF2freq codebase implementing our approach is available under a BSD-style license.
Haplotyping; Phasing; Genotype inference; Nuclear family data; Hidden Markov models
Despite the significant advances made over the last few years in mapping inversions with the advent of paired-end sequencing approaches, our understanding of the prevalence and spectrum of inversions in the human genome has lagged behind other types of structural variants, mainly due to the lack of a cost-efficient method applicable to large-scale samples. We propose a novel method based on principal components analysis (PCA) to characterize inversion polymorphisms using high-density SNP genotype data. Our method applies to non-recurrent inversions for which recombination between the inverted and non-inverted segments in inversion heterozygotes is suppressed due to the loss of unbalanced gametes. Inside such an inversion region, an effect similar to population substructure is thus created: two distinct “populations” of inversion homozygotes of different orientations and their 1∶1 admixture, namely the inversion heterozygotes. This kind of substructure can be readily detected by performing PCA locally in the inversion regions. Using simulations, we demonstrated that the proposed method can be used to detect and genotype inversion polymorphisms using unphased genotype data. We applied our method to the phase III HapMap data and inferred the inversion genotypes of known inversion polymorphisms at 8p23.1 and 17q21.31. These inversion genotypes were validated by comparing with literature results and by checking Mendelian consistency using the family data whenever available. Based on the PCA-approach, we also performed a preliminary genome-wide scan for inversions using the HapMap data, which resulted in 2040 candidate inversions, 169 of which overlapped with previously reported inversions. Our method can be readily applied to the abundant SNP data, and is expected to play an important role in developing human genome maps of inversions and exploring associations between inversions and susceptibility of diseases.
Typically locus specific genotype data do not contain information regarding the gametic phase of haplotypes, especially when an individual is heterozygous at more than one locus among a large number of linked polymorphic loci. Thus, studying disease-haplotype association using unphased genotype data is essentially a problem of handling a missing covariate in a case-control design. There are several methods for estimating a disease-haplotype association parameter in a matched case-control study. Here we propose a conditional likelihood approach for inference regarding the disease-haplotype association using unphased genotype data arising from a matched case-control study design. The proposed method relies on a logistic disease risk model and a Hardy-Weinberg equilibrium (HWE) among the control population only. We develop an expectation and conditional maximization (ECM) algorithm for jointly estimating the haplotype frequency and the disease-haplotype association parameter(s). We apply the proposed method to analyze the data from the Alpha-Tocopherol, Beta-Carotene Cancer prevention study, and a matched case-control study of breast cancer patients conducted in Israel. The performance of the proposed method is evaluated via simulation studies.
Inferring the ancestral dynamics of effective population size is a long-standing question in population genetics, which can now be tackled much more accurately thanks to the massive genomic data available in many species. Several promising methods that take advantage of whole-genome sequences have been recently developed in this context. However, they can only be applied to rather small samples, which limits their ability to estimate recent population size history. Besides, they can be very sensitive to sequencing or phasing errors. Here we introduce a new approximate Bayesian computation approach named PopSizeABC that allows estimating the evolution of the effective population size through time, using a large sample of complete genomes. This sample is summarized using the folded allele frequency spectrum and the average zygotic linkage disequilibrium at different bins of physical distance, two classes of statistics that are widely used in population genetics and can be easily computed from unphased and unpolarized SNP data. Our approach provides accurate estimations of past population sizes, from the very first generations before present back to the expected time to the most recent common ancestor of the sample, as shown by simulations under a wide range of demographic scenarios. When applied to samples of 15 or 25 complete genomes in four cattle breeds (Angus, Fleckvieh, Holstein and Jersey), PopSizeABC revealed a series of population declines, related to historical events such as domestication or modern breed creation. We further highlight that our approach is robust to sequencing errors, provided summary statistics are computed from SNPs with common alleles.
Molecular data sampled from extant individuals contains considerable information about their demographic history. In particular, one classical question in population genetics is to reconstruct past population size changes from such data. Relating these changes to various climatic, geological or anthropogenic events allows characterizing the main factors driving genetic diversity and can have major outcomes for conservation. Until recently, mostly very simple histories, including one or two population size changes, could be estimated from genetic data. This has changed with the sequencing of entire genomes in many species, and several methods allow now inferring complex histories consisting of several tens of population size changes. However, analyzing entire genomes, while accounting for recombination, remains a statistical and numerical challenge. These methods, therefore, can only be applied to small samples with a few diploid genomes. We overcome this limitation by using an approximate estimation approach, where observed genomes are summarized using a small number of statistics related to allele frequencies and linkage disequilibrium. In contrast to previous approaches, we show that our method allows us to reconstruct also the most recent part (the last 100 generations) of the population size history. As an illustration, we apply it to large samples of whole-genome sequences in four cattle breeds.
Clonal expansion is a process in which a single organism reproduces asexually, giving rise to a diversifying population. It is pervasive in nature, from within-host pathogen evolution to emergent infectious disease outbreaks. Standard phylogenetic tools rely on full-length genomes of individual pathogens or population consensus sequences (phased genotypes).
Although high-throughput sequencing technologies are able to sample population diversity, the short sequence reads inherent to them preclude assessing whether two reads originate from the same clone (unphased genotypes). This obstacle severely limits the application of phylogenetic methods and investigation of within-host dynamics of acute infections using this rich data source.
We introduce two measures of diversity to study the evolution of clonal populations using unphased genomic data, which eliminate the need to construct full-length genomes. Our method follows a maximum likelihood approach to estimate evolutionary rates and times to the most recent common ancestor, based on a relaxed molecular clock model; independent of a growth model. Deviations from neutral evolution indicate the presence of selection and bottleneck events.
We evaluated our methods in silico and then compared it against existing approaches with the well-characterized 2009 H1N1 influenza pandemic. We then applied our method to high-throughput genomic data from marburgvirus-infected non-human primates and inferred the time of infection and the intra-host evolutionary rate, and identified purifying selection in viral populations.
Our method has the power to make use of minor variants present in less than 1% of the population and capture genomic diversification within days of infection, making it an ideal tool for the study of acute RNA viral infection dynamics.
Clonal evolution; Evolutionary dynamics; Viral genomic diversity; Marburgvirus
To understand individual genomes it is necessary to look at the variations that lead to changes in phenotype and possibly to disease. However, genotype information alone is often not sufficient and additional knowledge regarding the phase of the variation is needed to make correct interpretations. Interactive visualizations, that allow the user to explore the data in various ways, can be of great assistance in the process of making well informed decisions. But, currently there is a lack for visualizations that are able to deal with phased haplotype data.
We present inPHAP, an interactive visualization tool for genotype and phased haplotype data. inPHAP features a variety of interaction possibilities such as zooming, sorting, filtering and aggregation of rows in order to explore patterns hidden in large genetic data sets. As a proof of concept, we apply inPHAP to the phased haplotype data set of Phase 1 of the 1000 Genomes Project. Thereby, inPHAP’s ability to show genetic variations on the population as well as on the individuals level is demonstrated for several disease related loci.
As of today, inPHAP is the only visual analytical tool that allows the user to explore unphased and phased haplotype data interactively. Due to its highly scalable design, inPHAP can be applied to large datasets with up to 100 GB of data, enabling users to visualize even large scale input data. inPHAP closes the gap between common visualization tools for unphased genotype data and introduces several new features, such as the visualization of phased data. inPHAP is available for download at http://bit.ly/1iJgKmX.
Genotype data; Phased haplotype data; Interactive visualization; 1000 genomes project
Motivation: Haplotypes, defined as the sequence of alleles on one chromosome, are crucial for many genetic analyses. As experimental determination of haplotypes is extremely expensive, haplotypes are traditionally inferred using computational approaches from genotype data, i.e. the mixture of the genetic information from both haplotypes. Best performing approaches for haplotype inference rely on Hidden Markov Models, with the underlying assumption that the haplotypes of a given individual can be represented as a mosaic of segments from other haplotypes in the same population. Such algorithms use this model to predict the most likely haplotypes that explain the observed genotype data conditional on reference panel of haplotypes. With rapid advances in short read sequencing technologies, sequencing is quickly establishing as a powerful approach for collecting genetic variation information. As opposed to traditional genotyping-array technologies that independently call genotypes at polymorphic sites, short read sequencing often collects haplotypic information; a read spanning more than one polymorphic locus (multi-single nucleotide polymorphic read) contains information on the haplotype from which the read originates. However, this information is generally ignored in existing approaches for haplotype phasing and genotype-calling from short read data.
Results: In this article, we propose a novel framework for haplotype inference from short read sequencing that leverages multi-single nucleotide polymorphic reads together with a reference panel of haplotypes. The basis of our approach is a new probabilistic model that finds the most likely haplotype segments from the reference panel to explain the short read sequencing data for a given individual. We devised an efficient sampling method within a probabilistic model to achieve superior performance than existing methods. Using simulated sequencing reads from real individual genotypes in the HapMap data and the 1000 Genomes projects, we show that our method is highly accurate and computationally efficient. Our haplotype predictions improve accuracy over the basic haplotype copying model by ∼20% with comparable computational time, and over another recently proposed approach Hap-SeqX by ∼10% with significantly reduced computational time and memory usage.
Availability: Publicly available software is available at http://genetics.cs.ucla.edu/harsh
firstname.lastname@example.org or email@example.com
The use of genetic data to reconstruct the transmission tree of infectious disease epidemics and outbreaks has been the subject of an increasing number of studies, but previous approaches have usually either made assumptions that are not fully compatible with phylogenetic inference, or, where they have based inference on a phylogeny, have employed a procedure that requires this tree to be fixed. At the same time, the coalescent-based models of the pathogen population that are employed in the methods usually used for time-resolved phylogeny reconstruction are a considerable simplification of epidemic process, as they assume that pathogen lineages mix freely. Here, we contribute a new method that is simultaneously a phylogeny reconstruction method for isolates taken from an epidemic, and a procedure for transmission tree reconstruction. We observe that, if one or more samples is taken from each host in an epidemic or outbreak and these are used to build a phylogeny, a transmission tree is equivalent to a partition of the set of nodes of this phylogeny, such that each partition element is a set of nodes that is connected in the full tree and contains all the tips corresponding to samples taken from one and only one host. We then implement a Monte Carlo Markov Chain (MCMC) procedure for simultaneous sampling from the spaces of both trees, utilising a newly-designed set of phylogenetic tree proposals that also respect node partitions. We calculate the posterior probability of these partitioned trees based on a model that acknowledges the population structure of an epidemic by employing an individual-based disease transmission model and a coalescent process taking place within each host. We demonstrate our method, first using simulated data, and then with sequences taken from the H7N7 avian influenza outbreak that occurred in the Netherlands in 2003. We show that it is superior to established coalescent methods for reconstructing the topology and node heights of the phylogeny and performs well for transmission tree reconstruction when the phylogeny is well-resolved by the genetic data, but caution that this will often not be the case in practice and that existing genetic and epidemiological data should be used to configure such analyses whenever possible. This method is available for use by the research community as part of BEAST, one of the most widely-used packages for reconstruction of dated phylogenies.
With sequence data becoming available in increasing high volumes and at decreasing costs, there has been substantial recent interest in the possibility of using pathogen genome sequences as a means to retrace the spread of disease amongst the infected hosts in an epidemic. While several such methods exist, many of them are not fully compatible with phylogenetic inference, which is the most commonly-used methodology for exploring the ancestry of the isolates represented by a set of sequences. Procedures using phylogenetics as a basis have either taken a single, fixed phylogenetic tree as input, or have been quite narrow in scope and not available in any current package for general use. For their part, standard phylogenetic methods usually assume a model of the pathogen population that is overly simplistic for the situation in an epidemic. Here, we bridge the gap by introducing a new, highly flexible method, implemented in the publicly-available BEAST package, which simultaneously reconstructs the transmission history of an epidemic and the phylogeny for samples taken from it. We apply the procedure to simulated data and to sequences from the 2003 H7N7 avian influenza outbreak in the Netherlands.
Whole genome association studies (WGAS) have surged in popularity in recent years as technological advances have made large-scale genotyping more feasible and as new exciting results offer tremendous hope and optimism. The logic of WGAS rests upon the common disease/common variant (CD/CV) hypothesis. Detection of association under the common disease/rare variant (CD/RV) scenario is much harder, and the current practices of WGAS may be under-power without large enough sample sizes. In this paper, we propose a generalized linear model with regularization (rGLM) approach for detecting disease-haplotype association using unphased single nucleotide polymorphisms data that is applicable to both CD/CV and CD/RV scenarios. We borrow a dimension-reduction method from the data mining and statistical learning literature, but use it for the purpose of weeding out haplotypes that are not associated with the disease so that the associated haplotypes, especially those that are rare, can stand out and be accounted for more precisely. By using high-dimensional data analysis techniques, which are frequently employed in microarray analyses, interacting effects among haplotypes in different blocks can be investigated without much concern about the sample size being overwhelmed by the number of haplotype combinations. Our simulation study demonstrates the gain in power for detecting associations with moderate sample sizes. For detecting association under CD/RV, regression type methods such as that implemented in hapassoc may fail to provide coefficient estimates for rare associated haplotypes, resulting in a loss of power compared to rGLM. Furthermore, our results indicate that rGLM can uncover the associated variants much more frequently than can hapassoc.
whole genome association study; interacting effects between haplotype blocks; dimension reduction; regularization/LASSO; case-control design
The completion of the HapMap project has stimulated further development of haplotype-based methodologies for disease associations. A key aspect of such development is the statistical inference of individual diplotypes from unphased genotypes. Several methodologies for inferring haplotypes have been developed, but they have not been evaluated extensively to determine which method not only performs well, but also can be easily incorporated in downstream haplotype-based association analyses. In this paper, we attempt to do so. Our evaluation was carried out by comparing the two leading Bayesian methods, implemented in PHASE and HAPLOTYPER, and the two leading empirical methods, implemented in PL-EM and HPlus. We used these methods to analyze real data, namely the dense genotypes on X-chromosome of 30 European and 30 African trios provided by the International HapMap Project, and simulated genotype data. Our conclusions are based on these analyses.
All programs performed very well on X-chromosome data, with an average similarity index of 0.99 and an average prediction rate of 0.99 for both European and African trios. On simulated data with approximation of coalescence, PHASE implementing the Bayesian method based on the coalescence approximation outperformed other programs on small sample sizes. When the sample size increased, other programs performed as well as PHASE. PL-EM and HPlus implementing empirical methods required much less running time than the programs implementing the Bayesian methods. They required only one hundredth or thousandth of the running time required by PHASE, particularly when analyzing large sample sizes and large umber of SNPs.
For large sample sizes (hundreds or more), which most association studies require, the two empirical methods might be used since they infer the haplotypes as accurately as any Bayesian methods and can be incorporated easily into downstream haplotype-based analyses such as haplotype-association analyses.
Recently, there have been many case-control studies proposed to test for association between haplotypes and disease, which require the Hardy-Weinberg equilibrium (HWE) assumption of haplotype frequencies. As such, haplotype inference of unphased genotypes and development of haplotype-based HWE tests are crucial prior to fine mapping. The goodness-of-fit test is a frequently-used method to test for HWE for multiple tightly-linked loci. However, its degrees of freedom dramatically increase with the increase of the number of loci, which may lack the test power. Therefore, in this paper, to improve the test power for haplotype-based HWE, we first write out two likelihood functions of the observed data based on the Niu's model (NM) and inbreeding model (IM), respectively, which can cause the departure from HWE. Then, we use two expectation-maximization algorithms and one expectation-conditional-maximization algorithm to estimate the model parameters under the HWE, IM and NM models, respectively. Finally, we propose the likelihood ratio tests LRT and LRT for haplotype-based HWE under the NM and IM models, respectively. We simulate the HWE, Niu's, inbreeding and population stratification models to assess the validity and compare the performance of these two LRT tests. The simulation results show that both of the tests control the type I error rates well in testing for haplotype-based HWE. If the NM model is true, then LRT is more powerful. While, if the true model is the IM model, then LRT has better performance in power. Under the population stratification model, LRT is still more powerful. To this end, LRT is generally recommended. Application of the proposed methods to a rheumatoid arthritis data set further illustrates their utility for real data analysis.
The genetic association analysis using haplotypes as basic genetic units is anticipated to be a powerful strategy towards the discovery of genes predisposing human complex diseases. In particular, the increasing availability of high-resolution genetic markers such as the single-nucleotide polymorphisms (SNPs) has made haplotype-based association analysis an attractive alternative to single marker analysis.
We consider haplotype association analysis under the population-based case-control study design. A multinomial logistic model is proposed for haplotype analysis with unphased genotype data, which can be decomposed into a prospective logistic model for disease risk as well as a model for the haplotype-pair distribution in the control population. Environmental factors can be readily incorporated and hence the haplotype-environment interaction can be assessed in the proposed model. The maximum likelihood estimation with unphased genotype data can be conveniently implemented in the proposed model by applying the EM algorithm to a prospective multinomial logistic regression model and ignoring the case-control design. We apply the proposed method to the hypertriglyceridemia study and identifies 3 haplotypes in the apolipoprotein A5 gene that are associated with increased risk for hypertriglyceridemia. A haplotype-age interaction effect is also identified. Simulation studies show that the proposed estimator has satisfactory finite-sample performances.
Our results suggest that the proposed method can serve as a useful alternative to existing methods and a reliable tool for the case-control haplotype-based association analysis.
Segments of genome inherited from a common ancestor by related individuals are said to be identical by descent (IBD). Modern genetic marker data provide information to infer such segments among multiple related members of a population, even when pedigree relationships are unknown. Previous methods have been proposed for detection of pairwise IBD, but computation of probabilities of trait data under many trait models requires an IBD estimate jointly consistent among individuals and slowly varying across genome locations; we refer to such an estimate as an IBD graph. In this paper, we develop a novel method that builds IBD graphs sequentially among related individuals from a population sample using either phased or unphased genetic marker data. We show how IBD graphs realized conditionally on marker data provide a form of linkage mapping score, analogous to a LOD score, and propose a permutation approach to normalize this mapping score. Using a simulated quantitative trait dependent on the (unobserved) genotype at a major locus, we apply the approach to two samples containing both closely and remotely related individuals, among whom there are complex patterns of IBD. We compare the results of our approach with an alternate approach based on estimation of local kinship. We show that pairwise estimates derived from a joint IBD graph give significant improvements in LOD score estimation over estimates derived from an intrinsically pairwise approach.
identity by descent (IBD); detecting IBD segments; joint IBD constraints; sampling joint IBD graphs; likelihood; linkage mapping
Many common inference problems in computational genetics depend on inferring aspects of the evolutionary history of a data set given a set of observed modern sequences. Detailed predictions of the full phylogenies are therefore of value in improving our ability to make further inferences about population history and sources of genetic variation. Making phylogenetic predictions on the scale needed for whole-genome analysis is, however, extremely computationally demanding.
In order to facilitate phylogeny-based predictions on a genomic scale, we develop a library of maximum parsimony phylogenies within local regions spanning all autosomal human chromosomes based on Haplotype Map variation data. We demonstrate the utility of this library for population genetic inferences by examining a tree statistic we call 'imperfection,' which measures the reuse of variant sites within a phylogeny. This statistic is significantly predictive of recombination rate, shows additional regional and population-specific conservation, and allows us to identify outlier genes likely to have experienced unusual amounts of variation in recent human history.
Recent theoretical advances in algorithms for phylogenetic tree reconstruction have made it possible to perform large-scale inferences of local maximum parsimony phylogenies from single nucleotide polymorphism (SNP) data. As results from the imperfection statistic demonstrate, phylogeny predictions encode substantial information useful for detecting genomic features and population history. This data set should serve as a platform for many kinds of inferences one may wish to make about human population history and genetic variation.
With the advances in high-throughput genotyping technology, the study of quantitative trait loci (QTL) has emerged as a promising tool to understand the genetic basis of complex traits. Methodology development for the study of QTL recently has attracted significant research attention. Local phylogeny-based methods have been demonstrated to be powerful tools for uncovering significant associations between phenotypes and single-nucleotide polymorphism markers. However, most existing methods are designed for homozygous genotypes, and a separate haplotype reconstruction step is often needed to resolve heterozygous genotypes. This approach has limited power to detect nonadditive genetic effects and imposes an extensive computational burden. In this article, we propose a new method, HTreeQA, that uses a tristate semi-perfect phylogeny tree to approximate the perfect phylogeny used in existing methods. The semi-perfect phylogeny trees are used as high-level markers for association study. HTreeQA uses the genotype data as direct input without phasing. HTreeQA can handle complex local population structures. It is suitable for QTL mapping on any mouse populations, including the incipient Collaborative Cross lines. Applied HTreeQA, significant QTLs are found for two phenotypes of the PreCC lines, white head spot and running distance at day 5/6. These findings are consistent with known genes and QTL discovered in independent studies. Simulation studies under three different genetic models show that HTreeQA can detect a wider range of genetic effects and is more efficient than existing phylogeny-based approaches. We also provide rigorous theoretical analysis to show that HTreeQA has a lower error rate than alternative methods.
phylogeny; quantitative trait loci (QTL); Mouse Collaborative Cross; Mouse Genetic Resource
Because current molecular haplotyping methods are expensive and not amenable to automation, many researchers rely on statistical methods to infer haplotype pairs from multilocus genotypes, and subsequently treat these inferred haplotype pairs as observations. These procedures are prone to haplotype misclassification. We examine the effect of these misclassification errors on the false-positive rate and power for two association tests. These tests include the standard likelihood ratio test (LRTstd) and a likelihood ratio test that employs a double-sampling approach to allow for the misclassification inherent in the haplotype inference procedure (LRTae). We aim to determine the cost–benefit relationship of increasing the proportion of individuals with molecular haplotype measurements in addition to genotypes to raise the power gain of the LRTae over the LRTstd. This analysis should provide a guideline for determining the minimum number of molecular haplotypes required for desired power. Our simulations under the null hypothesis of equal haplotype frequencies in cases and controls indicate that (1) for each statistic, permutation methods maintain the correct type I error; (2) specific multilocus genotypes that are misclassified as the incorrect haplotype pair are consistently misclassified throughout each entire dataset; and (3) our simulations under the alternative hypothesis showed a significant power gain for the LRTae over the LRTstd for a subset of the parameter settings. Permutation methods should be used exclusively to determine significance for each statistic. For fixed cost, the power gain of the LRTae over the LRTstd varied depending on the relative costs of genotyping, molecular haplotyping, and phenotyping. The LRTae showed the greatest benefit over the LRTstd when the cost of phenotyping was very high relative to the cost of genotyping. This situation is likely to occur in a replication study as opposed to a whole-genome association study.
Localizing genes for complex genetic diseases presents a major challenge. Recent technological advances such as genotyping arrays containing hundreds of thousands of genomic “landmarks,” and databases cataloging these “landmarks” and the levels of correlation between them, have aided in these endeavors. To utilize these resources most effectively, many researchers employ a gene-mapping technique called haplotype-based association in order to examine the variation present at multiple genomic sites jointly for a role in and/or an association with the disease state. Although methods that determine haplotype pairs directly by biological assays are currently available, they rarely are used due to their expense and incongruity to automation. Statistical methods provide an inexpensive, relatively accurate means to determine haplotype pairs. However, these statistical methods can provide erroneous results. In this article, the authors compare a standard statistical method for performing a haplotype-based association test with a method that accounts for the misclassification of haplotype pairs as part of the test. Under a number of feasible scenarios, the performance of the new test exceeded that of the standard test.
Multilocus analysis of single nucleotide polymorphism haplotypes is a promising approach to dissecting the genetic basis of complex diseases. We propose a coalescent-based model for association mapping that potentially increases the power to detect disease-susceptibility variants in genetic association studies. The approach uses Bayesian partition modelling to cluster haplotypes with similar disease risks by exploiting evolutionary information. We focus on candidate gene regions with densely spaced markers and model chromosomal segments in high linkage disequilibrium therein assuming a perfect phylogeny. To make this assumption more realistic, we split the chromosomal region of interest into sub-regions or windows of high linkage disequilibrium. The haplotype space is then partitioned into disjoint clusters, within which the phenotype–haplotype association is assumed to be the same. For example, in case-control studies, we expect chromosomal segments bearing the causal variant on a common ancestral background to be more frequent among cases than controls, giving rise to two separate haplotype clusters. The novelty of our approach arises from the fact that the distance used for clustering haplotypes has an evolutionary interpretation, as haplotypes are clustered according to the time to their most recent common ancestor. Our approach is fully Bayesian and we develop a Markov Chain Monte Carlo algorithm to sample efficiently over the space of possible partitions. We compare the proposed approach to both single-marker analyses and recently proposed multi-marker methods and show that the Bayesian partition modelling performs similarly in localizing the causal allele while yielding lower false-positive rates. Also, the method is computationally quicker than other multi-marker approaches. We present an application to real genotype data from the CYP2D6 gene region, which has a confirmed role in drug metabolism, where we succeed in mapping the location of the susceptibility variant within a small error.
Genetic association studies offer great promise in dissecting the genetic contribution to complex diseases. The underlying idea of such studies is to search for genetic variants along the genome that appear to be associated with a trait of interest, e.g., disease status for a binary trait. One then proceeds by genotyping unrelated individuals at several marker sites, searching for positions where single markers or combinations of multiple markers on the paternally and maternally inherited chromosomes (or haplotypes) appear to discriminate among affected and unaffected individuals, flagging genomic regions that may harbour disease susceptibility variants. The statistical analysis of such studies, however, poses several challenges, such as multiplicity and false-positives issue, due to the large number of markers considered. Focusing on case-control studies, we present a novel evolution-based Bayesian partition model that clusters haplotypes with similar disease risks. The novelty of this approach lies in the use of perfect phylogenies, which offers a sensible and computationally efficient approximation of the ancestry of a sample of chromosomes. We show that the incorporation of phylogenetic information leads to low false-positive rates, while our model fitting offers computational advantages over similar recently proposed coalescent-based haplotype clustering methods.
Numerous immune-mediated diseases have been associated with the class I and II HLA genes located within the major histocompatibility complex (MHC) consisting of highly polymorphic alleles encoded by the HLA-A, -B, -C, -DRB1, -DQB1 and -DPB1 loci. Genotyping for HLA alleles is complex and relatively expensive. Recent studies have demonstrated the feasibility of predicting HLA alleles, using MHC SNPs inside and outside of HLA that are typically included in SNP arrays and are commonly available in genome-wide association studies (GWAS). We have recently described a novel method that is complementary to the previous methods, for accurately predicting HLA alleles using unphased flanking SNPs genotypes. In this manuscript, we address several practical issues relevant to the application of this methodology.
Applying this new methodology to three large independent study cohorts, we have evaluated the performance of the predictive models in ethnically diverse populations. Specifically, we have found that utilizing imputed in addition to genotyped SNPs generally yields comparable if not better performance in prediction accuracies. Our evaluation also supports the idea that predictive models trained on one population are transferable to other populations of the same ethnicity. Further, when the training set includes multi-ethnic populations, the resulting models are reliable and perform well for the same subpopulations across all HLA genes. In contrast, the predictive models built from single ethnic populations have superior performance within the same ethnic population, but are not likely to perform well in other ethnic populations.
The empirical explorations reported here provide further evidence in support of the application of this approach for predicting HLA alleles with GWAS-derived SNP data. Utilizing all available samples, we have built "state of the art" predictive models for HLA-A, -B, -C, -DRB1, -DQB1 and -DPB1. The HLA allele predictive models, along with the program used to carry out the prediction, are available on our website.
Using molecular data generated by high throughput next generation sequencing (NGS) platforms to infer phylogeny is becoming common as costs go down and the ability to capture loci from across the genome goes up. While there is a general consensus that greater numbers of independent loci should result in more robust phylogenetic estimates, few studies have compared phylogenies resulting from smaller datasets for commonly used genetic markers with the large datasets captured using NGS. Here, we determine how a 5-locus Sanger dataset compares with a 377-locus anchored genomics dataset for understanding the evolutionary history of the pseudoxyrhophiine snake radiation centered in Madagascar. The Pseudoxyrhophiinae comprise ~86 % of Madagascar’s serpent diversity, yet they are poorly known with respect to ecology, behavior, and systematics. Using the 377-locus NGS dataset and the summary statistics species-tree methods STAR and MP-EST, we estimated a well-supported species tree that provides new insights concerning intergeneric relationships for the pseudoxyrhophiines. We also compared how these and other methods performed with respect to estimating tree topology using datasets with varying numbers of loci.
Using Sanger sequencing and an anchored phylogenomics approach, we sequenced datasets comprised of 5 and 377 loci, respectively, for 23 pseudoxyrhophiine taxa. For each dataset, we estimated phylogenies using both gene-tree (concatenation) and species-tree (STAR, MP-EST) approaches. We determined the similarity of resulting tree topologies from the different datasets using Robinson-Foulds distances. In addition, we examined how subsets of these data performed compared to the complete Sanger and anchored datasets for phylogenetic accuracy using the same tree inference methodologies, as well as the program *BEAST to determine if a full coalescent model for species tree estimation could generate robust results with fewer loci compared to the summary statistics species tree approaches. We also examined the individual gene trees in comparison to the 377-locus species tree using the program MetaTree.
Using the full anchored dataset under a variety of methods gave us the same, well-supported phylogeny for pseudoxyrhophiines. The African pseudoxyrhophiine Duberria is the sister taxon to the Malagasy pseudoxyrhophiines genera, providing evidence for a monophyletic radiation in Madagascar. In addition, within Madagascar, the two major clades inferred correspond largely to the aglyphous and opisthoglyphous genera, suggesting that feeding specializations associated with tooth venom delivery may have played a major role in the early diversification of this radiation. The comparison of tree topologies from the concatenated and species-tree methods using different datasets indicated the 5-locus dataset cannot beused to infer a correct phylogeny for the pseudoxyrhophiines under any method tested here and that summary statistics methods require 50 or more loci to consistently recover the species-tree inferred using the complete anchored dataset. However, as few as 15 loci may infer the correct topology when using the full coalescent species tree method *BEAST. MetaTree analyses of each gene tree from the Sanger and anchored datasets found that none of the individual gene trees matched the 377-locus species tree, and that no gene trees were identical with respect to topology.
Our results suggest that ≥50 loci may be necessary to confidently infer phylogenies when using summaryspecies-tree methods, but that the coalescent-based method *BEAST consistently recovers the same topology using only 15 loci. These results reinforce that datasets with small numbers of markers may result in misleading topologies, and further, that the method of inference used to generate a phylogeny also has a major influence on the number of loci necessary to infer robust species trees.
Electronic supplementary material
The online version of this article (doi:10.1186/s12862-015-0503-1) contains supplementary material, which is available to authorized users.
Madagascar; Anchored phylogenomics; Lamprophiidae; Next-generation sequencing