Chemokine signals and their cell-surface receptors are important modulators of HIV-1 disease and cancer. To aid future case/control association studies, aim to further characterise the haplotype structure of variation in chemokine and chemokine receptor genes. To perform haplotype analysis in a population-based association study, haplotypes must be determined by estimation, in the absence of family information or laboratory methods to establish phase. Here, test the accuracy of estimates of haplotype frequency and linkage disequilibrium by comparing estimated haplotypes generated with the expectation maximisation (EM) algorithm to haplotypes determined from Centre d'Etude Polymorphisme Humain (CEPH) pedigree data. To do this, they have characterised haplotypes comprising alleles at 11 biallelic loci in four chemokine receptor genes (CCR3, CCR2, CCR5 and CCRL2), which span 150 kb on chromosome 3p21, and haplotyes of nine biallelic loci in six chemokine genes [MCP-1(CCL2), Eotaxin(CCL11), RANTES(CCL5), MPIF-1(CCL23), PARC(CCL18) and MIP-1α(CCL3) ] on chromosome 17q11-12. Forty multi-generation CEPH families, totalling 489 individuals, were genotyped by the TaqMan 5'-nuclease assay. Phased haplotypes and haplotypes estimated from unphased genotypes were compared in 103 grandparents who were assumed to have mated at random.
For the 3p21 single nucleotide polymorphism (SNP) data, haplotypes determined by pedigree analysis and haplotypes generated by the EM algorithm were nearly identical. Linkage disequilibrium, measured by the D' statistic, was nearly maximal across the 150 kb region, with complete disequilibrium maintained at the extremes between CCR3-Y17Y and CCRL2-1243V. D'-values calculated from estimated haplotypes on 3p21 had high concordance with pairwise comparisons between pedigree-phased chromosomes. Conversely, there was less agreement between analyses of haplotype frequencies and linkage disequilibrium using estimated haplotypes when compared with pedigree-phased haplotypes of SNPs on chromosome 17q11-12. These results suggest that, while estimations of haplotype frequency and linkage disequilibrium may be relatively simple in the 3p21 chemokine receptor cluster in population samples, the more complex environment on chromosome 17q11-12 will require a higher resolution haplotype analysis.
chemokine; SNP; haplotype estimation; pedigree analysis; linkage disequilibrium
Haplotype inference based on unphased SNP markers is an important task in population genetics. Although there are different approaches to the inference of haplotypes in diploid species, the existing software is not suitable for inferring haplotypes from unphased SNP data in polyploid species, such as the cultivated potato (Solanum tuberosum). Potato species are tetraploid and highly heterozygous.
Here we present the software SATlotyper which is able to handle polyploid and polyallelic data. SATlo-typer uses the Boolean satisfiability problem to formulate Haplotype Inference by Pure Parsimony. The software excludes existing haplotype inferences, thus allowing for calculation of alternative inferences. As it is not known which of the multiple haplotype inferences are best supported by the given unphased data set, we use a bootstrapping procedure that allows for scoring of alternative inferences. Finally, by means of the bootstrapping scores, it is possible to optimise the phased genotypes belonging to a given haplotype inference. The program is evaluated with simulated and experimental SNP data generated for heterozygous tetraploid populations of potato. We show that, instead of taking the first haplotype inference reported by the program, we can significantly improve the quality of the final result by applying additional methods that include scoring of the alternative haplotype inferences and genotype optimisation. For a sub-population of nineteen individuals, the predicted results computed by SATlotyper were directly compared with results obtained by experimental haplotype inference via sequencing of cloned amplicons. Prediction and experiment gave similar results regarding the inferred haplotypes and phased genotypes.
Our results suggest that Haplotype Inference by Pure Parsimony can be solved efficiently by the SAT approach, even for data sets of unphased SNP from heterozygous polyploids. SATlotyper is freeware and is distributed as a Java JAR file. The software can be downloaded from the webpage of the GABI Primary Database at . The application of SATlotyper will provide haplotype information, which can be used in haplotype association mapping studies of polyploid plants.
Analyses of genetic data at the level of haplotypes provide increased accuracy and power to infer genotype-phenotype correlations and evolutionary history of a locus. However, empirical determination of haplotypes is expensive and laborious. Therefore, several methods of inferring haplotypes from unphased genotypic data have been proposed, but it is unclear how accurate each of the methods is or which methods are superior. The accuracy of some of the leading methods of computational haplotype inference (PL-EM, Phase, SNPHAP, Haplotyper) are compared using a large set of 308 empirically determined haplotypes based on 15 SNPs, among which 36 haplotypes were observed to occur. This study presents several advantages over many previous comparisons of haplotype inference methods: a large number of subjects are included, the number of known haplotypes is much smaller than the number of chromosomes surveyed, a range in values of linkage disequilibrium, presence of rare SNP alleles, and considerable dispersion in the frequencies of haplotypes.
In contrast to some previous comparisons of haplotype inference methods, there was very little difference in the accuracy of the various methods in terms of either assignment of haplotypes to individuals or estimation of haplotype frequencies. Although none of the methods inferred all of the known haplotypes, the assignment of haplotypes to subjects was about 90% correct for individuals heterozygous for up to three SNPs and was about 80% correct for up to five heterozygous sites. All of the methods identified every haplotype with a frequency above 1%, and none assigned a frequency above 1% to an incorrect haplotype.
All of the methods of haplotype inference have high accuracy and one can have confidence in inferences made by any one of the methods. The ability to identify even rare (≥ 1%) haplotypes is reassuring for efforts to identify haplotypes that contribute to disease in a significant proportion of a population. Assignment of haplotypes is relatively accurate among subjects heterozygous for up to 5 sites, and this might be the largest number of SNPs for which one should define haplotype blocks or have confidence in haplotype assignments.
Typically locus specific genotype data do not contain information regarding the gametic phase of haplotypes, especially when an individual is heterozygous at more than one locus among a large number of linked polymorphic loci. Thus, studying disease-haplotype association using unphased genotype data is essentially a problem of handling a missing covariate in a case-control design. There are several methods for estimating a disease-haplotype association parameter in a matched case-control study. Here we propose a conditional likelihood approach for inference regarding the disease-haplotype association using unphased genotype data arising from a matched case-control study design. The proposed method relies on a logistic disease risk model and a Hardy-Weinberg equilibrium (HWE) among the control population only. We develop an expectation and conditional maximization (ECM) algorithm for jointly estimating the haplotype frequency and the disease-haplotype association parameter(s). We apply the proposed method to analyze the data from the Alpha-Tocopherol, Beta-Carotene Cancer prevention study, and a matched case-control study of breast cancer patients conducted in Israel. The performance of the proposed method is evaluated via simulation studies.
The haplotype map constructed by the HapMap Project is a valuable resource in the genetic studies of disease genes, population structure, and evolution. In the Project, Caucasian and African haplotypes are fairly accurately inferred, based mainly on the rules of Mendelian inheritance using the genotypes of trios. However, the Asian haplotypes are inferred from the genotypes of unrelated individuals based on population genetics, and are less accurate. Thus, the effects of this inaccuracy on downstream analyses needs to be assessed. We determined true Japanese haplotypes by genotyping 100 complete hydatidiform moles (CHM), each carrying a genome derived from a single sperm, using Affymetrix 500 K Arrays. We then assessed how inferred haplotypes can differ from true haplotypes, by phasing pseudo-individualized true haplotypes using the programs PHASE, fastPHASE, and Beagle. We found that, at various genomic regions, especially the MHC locus, the expansion of extended haplotype homozygosity (EHH), which is a measure of positive selection, is obscured when inferred Asian haplotype data is used to detect the expansion. We then mapped the genome using a new statistic, XDiHH, which directly detects the difference between the true and inferred haplotypes, in the determination of EHH expansion. We also show that the true haplotype data presented here is useful to assess and improve the accuracy of phasing of Asian genotypes.
Precise haplotype maps are preferred for the performance of a variety of genetic studies including identification of disease-associated loci and dissection of evolutionary mechanisms such as selection and recombination. For diploid organisms, the haplotype information appears as the genotypes when we obtain the information using widely used high-throughput techniques. The process of extracting haplotype information from genotypes is called phasing, which can be accurately done if the genotypes are from related individuals, such as parent–child trios, by considering the constraints imposed by the rules of Mendelian inheritance. For the genotype data without family information, phasing is done by one of the methods that are based on haplotype clustering, and the inferred haplotypes are known to be less accurate. Here, we experimentally determined genome-wide definitive haplotypes using a collection of Japanese complete hydatidiform moles (CHM), each of which carries a genome derived from a single sperm. Using these resources, we asked if the definitive haplotype data can detect long-distance information that has been obscured when we rely solely on the haplotypes inferred by clustering. We also show that by introducing definitive haplotypes as references, inference of haplotypes of unrelated individuals is significantly improved.
The haplotypes of the X chromosome are accessible to direct count in males, whereas the diplotypes of the females may be inferred knowing the haplotype of their sons or fathers. Here, we investigated: 1) the possible large-scale haplotypic structure of the X chromosome in a Caucasian population sample, given the single-nucleotide polymorphism (SNP) maps and genotypes provided by Illumina and Affimetrix for Genetic Analysis Workshop 14, and, 2) the performances of widely used programs in reconstructing haplotypes from population genotypic data, given their known distribution in a sample of unrelated individuals.
All possible unrelated mother-son pairs of Caucasian ancestry (N = 104) were selected from the 143 families of the Collaborative Study on the Genetics of Alcoholism pedigree files, and the diplotypes of the mothers were inferred from the X chromosomes of their sons. The marker set included 313 SNPs at an average density of 0.47 Mb. Linkage disequilibrium between pairs of markers was computed by the parameter D', whereas for measuring multilocus disequilibrium, we developed here an index called D*, and applied it to all possible sliding windows of 5 markers each. Results showed a complex pattern of haplotypic structure, with regions of low linkage disequilibrium separated by regions of high values of D*. The following programs were evaluated for their accuracy in inferring population haplotype frequencies: 1) ARLEQUIN 2.001; 2) PHASE 2.1.1; 3) SNPHAP 1.1; 4) HAPLOBLOCK 1.2; 5) HAPLOTYPER 1.0. Performances were evaluated by Pearson correlation (r) coefficient between the true and the inferred distribution of haplotype frequencies.
The SNP haplotypic structure of the X chromosome is complex, with regions of high haplotype conservation interspersed among regions of higher haplotype diversity. All the tested programs were accurate (r = 1) in reconstructing the distribution of haplotype frequencies in case of high D* values. However, only the program PHASE realized a high correlation coefficient (r > 0.7) in conditions of low linkage disequilibrium.
Association methods based on haplotype similarity (HS) can overcome power and stability issues encountered in standard haplotype analyses. Current HS methods can be generally classified into evolutionary and two-sample approaches. We propose a new regression-based HS association method for case-control studies that incorporates covariate information and combines the advantages of the two classes of approaches by using inferred ancestral haplotypes. We first estimate the ancestral haplotypes of case individuals and then, for each individual, an ancestral-haplotype-based similarity score is computed by comparing that individual’s observed genotype with the estimated ancestral haplotypes. Trait values are then regressed on the similarity scores. Covariates can easily be incorporated into this regression framework. To account for the bias in the raw p-values due to the use of case data in constructing ancestral haplotypes, as well as to account for variation in ancestral haplotype estimation, a permutation procedure is adopted to obtain empirical p-values. Compared with the standard haplotype score test and the multilocus T2 test, our method improves power when neither the allele frequency nor linkage disequilibrium between the disease locus and its neighboring SNPs is too low and is comparable in other scenarios. We applied our method to the Genetic Analysis Workshop 15 simulated SNP data and successfully pinpointed a stretch of SNPs that covers the fine-scale region where the causal locus is located.
case-control studies; haplotype similarity; haplotype sharing; haplotype-based association test; covariates; regression-based association analysis
The completion of the HapMap project has stimulated further development of haplotype-based methodologies for disease associations. A key aspect of such development is the statistical inference of individual diplotypes from unphased genotypes. Several methodologies for inferring haplotypes have been developed, but they have not been evaluated extensively to determine which method not only performs well, but also can be easily incorporated in downstream haplotype-based association analyses. In this paper, we attempt to do so. Our evaluation was carried out by comparing the two leading Bayesian methods, implemented in PHASE and HAPLOTYPER, and the two leading empirical methods, implemented in PL-EM and HPlus. We used these methods to analyze real data, namely the dense genotypes on X-chromosome of 30 European and 30 African trios provided by the International HapMap Project, and simulated genotype data. Our conclusions are based on these analyses.
All programs performed very well on X-chromosome data, with an average similarity index of 0.99 and an average prediction rate of 0.99 for both European and African trios. On simulated data with approximation of coalescence, PHASE implementing the Bayesian method based on the coalescence approximation outperformed other programs on small sample sizes. When the sample size increased, other programs performed as well as PHASE. PL-EM and HPlus implementing empirical methods required much less running time than the programs implementing the Bayesian methods. They required only one hundredth or thousandth of the running time required by PHASE, particularly when analyzing large sample sizes and large umber of SNPs.
For large sample sizes (hundreds or more), which most association studies require, the two empirical methods might be used since they infer the haplotypes as accurately as any Bayesian methods and can be incorporated easily into downstream haplotype-based analyses such as haplotype-association analyses.
Identifying and understanding the impact of gene regulatory variation is of considerable importance in evolutionary and medical genetics; such variants are thought to be responsible for human-specific adaptation  and to have an important role in genetic disease. Regulatory variation in cis is readily detected in individuals showing uneven expression of a transcript from its two allelic copies, an observation referred to as allelic imbalance (AI). Identifying individuals exhibiting AI allows mapping of regulatory DNA regions and the potential to identify the underlying causal genetic variant(s). However, existing mapping methods require knowledge of the haplotypes, which make them sensitive to phasing errors. In this study, we introduce a genotype-based mapping test that does not require haplotype-phase inference to locate regulatory regions. The test relies on partitioning genotypes of individuals exhibiting AI and those not expressing AI in a 2×3 contingency table. The performance of this test to detect linkage disequilibrium (LD) between a potential regulatory site and a SNP located in this region was examined by analyzing the simulated and the empirical AI datasets. In simulation experiments, the genotype-based test outperforms the haplotype-based tests with the increasing distance separating the regulatory region from its regulated transcript. The genotype-based test performed equally well with the experimental AI datasets, either from genome–wide cDNA hybridization arrays or from RNA sequencing. By avoiding the need of haplotype inference, the genotype-based test will suit AI analyses in population samples of unknown haplotype structure and will additionally facilitate the identification of cis-regulatory variants that are located far away from the regulated transcript.
In many contexts, pedigrees for individuals are known even though not all individuals have been fully genotyped. In one extreme case, the genotypes for a set of full siblings are known, with no knowledge of parental genotypes. We propose a method for inferring phased haplotypes and genotypes for all individuals, even those with missing data, in such pedigrees, allowing a multitude of classic and recent methods for linkage and genome analysis to be used more efficiently.
By artificially removing the founder generation genotype data from a well-studied simulated dataset, the quality of reconstructed genotypes in that generation can be verified. For the full structure of repeated matings with 15 offspring per mating, 10 dams per sire, 99.89%
of all founder markers were phased correctly, given only the unphased genotypes for offspring. The accuracy was reduced only slightly, to 99.51%, when introducing a 2% error rate in offspring genotypes. When reduced to only 5 full-sib offspring in a single sire-dam mating, the corresponding percentage is 92.62%, which compares favorably with 89.28%
from the leading Merlin package. Furthermore, Merlin is unable to handle more than approximately 10 sibs, as the number of states tracked rises exponentially with family size, while our approach has no such limit and handles 150 half-sibs with ease in our experiments.
Our method is able to reconstruct genotypes for parents when genotype data is only available for offspring individuals, as well as haplotypes for all individuals. Compared to the Merlin package, we can handle larger pedigrees and produce superior results, mainly due to the fact that Merlin uses the Viterbi algorithm on the state space to infer the genotype sequence. Tracking of haplotype and allele origin can be used in any application where the marker set does not directly influence genotype variation influencing traits. Inference of genotypes can also reduce the effects of genotyping errors and missing data. The cnF2freq codebase implementing our approach is available under a BSD-style license.
Haplotyping; Phasing; Genotype inference; Nuclear family data; Hidden Markov models
Recently, there have been many case-control studies proposed to test for association between haplotypes and disease, which require the Hardy-Weinberg equilibrium (HWE) assumption of haplotype frequencies. As such, haplotype inference of unphased genotypes and development of haplotype-based HWE tests are crucial prior to fine mapping. The goodness-of-fit test is a frequently-used method to test for HWE for multiple tightly-linked loci. However, its degrees of freedom dramatically increase with the increase of the number of loci, which may lack the test power. Therefore, in this paper, to improve the test power for haplotype-based HWE, we first write out two likelihood functions of the observed data based on the Niu's model (NM) and inbreeding model (IM), respectively, which can cause the departure from HWE. Then, we use two expectation-maximization algorithms and one expectation-conditional-maximization algorithm to estimate the model parameters under the HWE, IM and NM models, respectively. Finally, we propose the likelihood ratio tests LRT and LRT for haplotype-based HWE under the NM and IM models, respectively. We simulate the HWE, Niu's, inbreeding and population stratification models to assess the validity and compare the performance of these two LRT tests. The simulation results show that both of the tests control the type I error rates well in testing for haplotype-based HWE. If the NM model is true, then LRT is more powerful. While, if the true model is the IM model, then LRT has better performance in power. Under the population stratification model, LRT is still more powerful. To this end, LRT is generally recommended. Application of the proposed methods to a rheumatoid arthritis data set further illustrates their utility for real data analysis.
To dissect the haplotype structure of candidate genes for disease association studies, it is important to understand the nature of genetic variation at these loci in different populations. We present a survey of haplotype structure and linkage disequilibrium of chemokine and chemokine receptor genes in 11 geographically-distinct population samples (n = 728). Chemokine proteins are involved in intercellular signalling and the immune response. These molecules are important modulators of human immunodeficiency virus (HIV)-1 infection and the progression of the acquired immune deficiency syndrome, tumour development and the metastatic process of cancer. To study the extent of genetic variation in this gene family, single nucleotide polymorphisms (SNPs) from 13 chemokine and chemokine receptor genes were genotyped using the 5' nuclease assay (TaqMan).
SNP haplotypes, estimated from unphased genotypes using the Expectation-Maximization-algorithm, are described in a cluster of four CC-chemokine receptor genes (CCR3, CCR2, CCR5 and CCRL2) on chromosome 3p21, and a cluster of three CC-chemokine genes [MPIF-1 (CCL23) PARC (CCL18) and MIP- 1α (CCL3)] on chromosome 17q11-12. The 32 base pair (bp) deletion in exon 4 of CCR5 was also included in the haplotype analysis of 3p21. A total of 87.5 per cent of the variation of 14 biallelic loci scattered over 150 kilobases of 3p21 is explained by 11 haplotypes which have a frequency of at least 1 per cent in the total sample. An analysis of haplotype blocks in this region indicates recombination between CCR2 and CCR5, although long-range pairwise linkage disequilibrium across the region appears to remain intact on two common haplotypes. A reduced-median network demonstrates a clear relationship between 3p21 haplotypes, rooted by the putative ancestral haplotype determined by direct sequencing of four primate species. Analysis of six SNPs on 17q11-12 indicates that 97.5 per cent of the variation is explained by 15 haplotypes, representing at least 1 per cent of the total sample. Additionally, a possible signature of selection at a non-synonymous coding SNP (M106V) in the MPIF-1 (CCL23) gene warrants further study. We anticipate that the results of this study of chemokine and chemokine receptor variation will be applicable to more extensive surveys of long-range haplotype structure in these gene regions and to association studies of HIV-1 disease and cancer.
chemokine; SNP; population genetics; variation; haplotype estimation; linkage disequilibrium
Multilocus analysis of single nucleotide polymorphism haplotypes is a promising approach to dissecting the genetic basis of complex diseases. We propose a coalescent-based model for association mapping that potentially increases the power to detect disease-susceptibility variants in genetic association studies. The approach uses Bayesian partition modelling to cluster haplotypes with similar disease risks by exploiting evolutionary information. We focus on candidate gene regions with densely spaced markers and model chromosomal segments in high linkage disequilibrium therein assuming a perfect phylogeny. To make this assumption more realistic, we split the chromosomal region of interest into sub-regions or windows of high linkage disequilibrium. The haplotype space is then partitioned into disjoint clusters, within which the phenotype–haplotype association is assumed to be the same. For example, in case-control studies, we expect chromosomal segments bearing the causal variant on a common ancestral background to be more frequent among cases than controls, giving rise to two separate haplotype clusters. The novelty of our approach arises from the fact that the distance used for clustering haplotypes has an evolutionary interpretation, as haplotypes are clustered according to the time to their most recent common ancestor. Our approach is fully Bayesian and we develop a Markov Chain Monte Carlo algorithm to sample efficiently over the space of possible partitions. We compare the proposed approach to both single-marker analyses and recently proposed multi-marker methods and show that the Bayesian partition modelling performs similarly in localizing the causal allele while yielding lower false-positive rates. Also, the method is computationally quicker than other multi-marker approaches. We present an application to real genotype data from the CYP2D6 gene region, which has a confirmed role in drug metabolism, where we succeed in mapping the location of the susceptibility variant within a small error.
Genetic association studies offer great promise in dissecting the genetic contribution to complex diseases. The underlying idea of such studies is to search for genetic variants along the genome that appear to be associated with a trait of interest, e.g., disease status for a binary trait. One then proceeds by genotyping unrelated individuals at several marker sites, searching for positions where single markers or combinations of multiple markers on the paternally and maternally inherited chromosomes (or haplotypes) appear to discriminate among affected and unaffected individuals, flagging genomic regions that may harbour disease susceptibility variants. The statistical analysis of such studies, however, poses several challenges, such as multiplicity and false-positives issue, due to the large number of markers considered. Focusing on case-control studies, we present a novel evolution-based Bayesian partition model that clusters haplotypes with similar disease risks. The novelty of this approach lies in the use of perfect phylogenies, which offers a sensible and computationally efficient approximation of the ancestry of a sample of chromosomes. We show that the incorporation of phylogenetic information leads to low false-positive rates, while our model fitting offers computational advantages over similar recently proposed coalescent-based haplotype clustering methods.
In case-control studies, genetic associations for complex diseases may be probed either with single-locus tests or with haplotype-based tests. Although there are different views on the relative merits and preferences of the two test strategies, haplotype-based analyses are generally believed to be more powerful to detect genes with modest effects. However, a main drawback of haplotype-based association tests is the large number of distinct haplotypes, which increases the degrees of freedom for corresponding test statistics and thus reduces the statistical power. To decrease the degrees of freedom and enhance the efficiency and power of haplotype analysis, we propose an improved haplotype clustering method that is based on the haplotype cladistic analysis developed by Durrant et al. In our method, we attempt to combine the strengths of single-locus analysis and haplotype-based analysis into one single test framework. Novel in our method is that we develop a more informative haplotype similarity measurement by using p-values obtained from single-locus association tests to construct a measure of weight, which to some extent incorporates the information of disease outcomes. The weights are then used in computation of similarity measures to construct distance metrics between haplotype pairs in haplotype cladistic analysis. To assess our proposed new method, we performed simulation analyses to compare the relative performances of (1) conventional haplotype-based analysis using original haplotype, (2) single-locus allele-based analysis, (3) original haplotype cladistic analysis (CLADHC) by Durrant et al., and (4) our weighted haplotype cladistic analysis method, under different scenarios. Our weighted cladistic analysis method shows an increased statistical power and robustness, compared with the methods of haplotype cladistic analysis, single-locus test, and the traditional haplotype-based analyses. The real data analyses also show that our proposed method has practical significance in the human genetics field.
Methods of haplotype-based analysis and single-locus analysis are widely used in genetic association studies. There is no consensus as to the best strategy for the performance of the two methods. Although haplotype-based analysis is a powerful tool, the large number of distinct haplotypes may reduce its efficiency. Haplotype clustering analysis is a promising way of decreasing haplotype dimensionality. A potential limitation of many existing clustering methods is that they do not allow the clustering to adapt to the position of the underlying trait locus. In this study, we proposed a weighted haplotype cladistic analysis method by incorporating a single-locus test into haplotype clustering. Under this framework, relationships between single loci and the disease outcomes can be considered when creating the hierarchical tree of haplotypes. The extensive simulations show that our method is robust against varied simulation conditions and is more powerful than either the original unweighted cladistic analysis method or single-locus analysis methods in case-control studies. Our hybrid method combining haplotype-based and single-locus analyses can be readily extended to whole genome association studies.
Genetic isolates such as the Ashkenazi Jews (AJ) potentially offer advantages in mapping novel loci in whole genome disease association studies. To analyze patterns of genetic variation in AJ, genotypes of 101 healthy individuals were determined using the Affymetrix EAv3 500 K SNP array and compared to 60 CEPH-derived HapMap (CEU) individuals. 435,632 SNPs overlapped and met annotation criteria in the two groups.
A small but significant global difference in allele frequencies between AJ and CEU was demonstrated by a mean FST of 0.009 (P < 0.001); large regions that differed were found on chromosomes 2 and 6. Haplotype blocks inferred from pairwise linkage disequilibrium (LD) statistics (Haploview) as well as by expectation-maximization haplotype phase inference (HAP) showed a greater number of haplotype blocks in AJ compared to CEU by Haploview (50,397 vs. 44,169) or by HAP (59,269 vs. 54,457). Average haplotype blocks were smaller in AJ compared to CEU (e.g., 36.8 kb vs. 40.5 kb HAP). Analysis of global patterns of local LD decay for closely-spaced SNPs in CEU demonstrated more LD, while for SNPs further apart, LD was slightly greater in the AJ. A likelihood ratio approach showed that runs of homozygous SNPs were approximately 20% longer in AJ. A principal components analysis was sufficient to completely resolve the CEU from the AJ.
LD in the AJ versus was lower than expected by some measures and higher by others. Any putative advantage in whole genome association mapping using the AJ population will be highly dependent on regional LD structure.
Recombination events tend to occur in hotspots and vary in number among individuals. The presence of recombination influences the accuracy of haplotype phasing and the imputation of missing genotypes. Genes that influence genome-wide recombination rate have been discovered in mammals, yeast, and plants. Our aim was to investigate the influence of recombination on haplotype phasing, locate recombination hotspots, scan the genome for Quantitative Trait Loci (QTL) and identify candidate genes that influence recombination, and quantify the impact of recombination on the accuracy of genotype imputation in beef cattle.
2775 Angus and 1485 Limousin parent-verified sire/offspring pairs were genotyped with the Illumina BovineSNP50 chip. Haplotype phasing was performed with DAGPHASE and BEAGLE using UMD3.1 assembly SNP (single nucleotide polymorphism) coordinates. Recombination events were detected by comparing the two reconstructed chromosomal haplotypes inherited by each offspring with those of their sires. Expected crossover probabilities were estimated assuming no interference and a binomial distribution for the frequency of crossovers. The BayesB approach for genome-wide association analysis implemented in the GenSel software was used to identify genomic regions harboring QTL with large effects on recombination. BEAGLE was used to impute Angus genotypes from a 7K subset to the 50K chip.
DAGPHASE was superior to BEAGLE in haplotype phasing, which indicates that linkage information from relatives can improve its accuracy. The estimated genetic length of the 29 bovine autosomes was 3097 cM, with a genome-wide recombination distance averaging 1.23 cM/Mb. 427 and 348 windows containing recombination hotspots were detected in Angus and Limousin, respectively, of which 166 were in common. Several significant SNPs and candidate genes, which influence genome-wide recombination were localized in QTL regions detected in the two breeds. High-recombination rates hinder the accuracy of haplotype phasing and genotype imputation.
Small population sizes, inadequate half-sib family sizes, recombination, gene conversion, genotyping errors, and map errors reduce the accuracy of haplotype phasing and genotype imputation. Candidate regions associated with recombination were identified in both breeds. Recombination analysis may improve the accuracy of haplotype phasing and genotype imputation from low- to high-density SNP panels.
Accurate information on haplotypes and diplotypes (haplotype pairs) is required for population-genetic analyses; however, microarrays do not provide data on a haplotype or diplotype at a copy number variation (CNV) locus; they only provide data on the total number of copies over a diplotype or an unphased sequence genotype (e.g., AAB, unlike AB of single nucleotide polymorphism). Moreover, such copy numbers or genotypes are often incorrectly determined when microarray signal intensities derived from different copy numbers or genotypes are not clearly separated due to noise. Here we report an algorithm to infer CNV haplotypes and individuals’ diplotypes at multiple loci from noisy microarray data, utilizing the probability that a signal intensity may be derived from different underlying copy numbers or genotypes. Performing simulation studies based on known diplotypes and an error model obtained from real microarray data, we demonstrate that this probabilistic approach succeeds in accurate inference (error rate: 1–2%) from noisy data, whereas previous deterministic approaches failed (error rate: 12–18%). Applying this algorithm to real microarray data, we estimated haplotype frequencies and diplotypes in 1486 CNV regions for 100 individuals. Our algorithm will facilitate accurate population-genetic analyses and powerful disease association studies of CNVs.
copy number variation; EM algorithm; haplotype inference; phasing
Since the completion of the HapMap project, huge numbers of individual genotypes have been generated from many kinds of laboratories. The efforts of finding or interpreting genetic association between disease and SNPs/haplotypes have been on-going widely. So, the necessity of the capability to analyze huge data and diverse interpretation of the results are growing rapidly.
We have developed an advanced tool to perform linkage disequilibrium analysis, and genetic association analysis between disease and SNPs/haplotypes in an integrated web interface. It comprises of four main analysis modules: (i) data import and preprocessing, (ii) haplotype estimation, (iii) LD blocking and (iv) association analysis. Hardy-Weinberg Equilibrium test is implemented for each SNPs in the data preprocessing. Haplotypes are reconstructed from unphased diploid genotype data, and linkage disequilibrium between pairwise SNPs is computed and represented by D', r2 and LOD score. Tagging SNPs are determined by using the square of Pearson's correlation coefficient (r2). If genotypes from two different sample groups are available, diverse genetic association analyses are implemented using additive, codominant, dominant and recessive models. Multiple verified algorithms and statistics are implemented in parallel for the reliability of the analysis.
SNPAnalyzer 2.0 performs linkage disequilibrium analysis and genetic association analysis in an integrated web interface using multiple verified algorithms and statistics. Diverse analysis methods, capability of handling huge data and visual comparison of analysis results are very comprehensive and easy-to-use.
Because current molecular haplotyping methods are expensive and not amenable to automation, many researchers rely on statistical methods to infer haplotype pairs from multilocus genotypes, and subsequently treat these inferred haplotype pairs as observations. These procedures are prone to haplotype misclassification. We examine the effect of these misclassification errors on the false-positive rate and power for two association tests. These tests include the standard likelihood ratio test (LRTstd) and a likelihood ratio test that employs a double-sampling approach to allow for the misclassification inherent in the haplotype inference procedure (LRTae). We aim to determine the cost–benefit relationship of increasing the proportion of individuals with molecular haplotype measurements in addition to genotypes to raise the power gain of the LRTae over the LRTstd. This analysis should provide a guideline for determining the minimum number of molecular haplotypes required for desired power. Our simulations under the null hypothesis of equal haplotype frequencies in cases and controls indicate that (1) for each statistic, permutation methods maintain the correct type I error; (2) specific multilocus genotypes that are misclassified as the incorrect haplotype pair are consistently misclassified throughout each entire dataset; and (3) our simulations under the alternative hypothesis showed a significant power gain for the LRTae over the LRTstd for a subset of the parameter settings. Permutation methods should be used exclusively to determine significance for each statistic. For fixed cost, the power gain of the LRTae over the LRTstd varied depending on the relative costs of genotyping, molecular haplotyping, and phenotyping. The LRTae showed the greatest benefit over the LRTstd when the cost of phenotyping was very high relative to the cost of genotyping. This situation is likely to occur in a replication study as opposed to a whole-genome association study.
Localizing genes for complex genetic diseases presents a major challenge. Recent technological advances such as genotyping arrays containing hundreds of thousands of genomic “landmarks,” and databases cataloging these “landmarks” and the levels of correlation between them, have aided in these endeavors. To utilize these resources most effectively, many researchers employ a gene-mapping technique called haplotype-based association in order to examine the variation present at multiple genomic sites jointly for a role in and/or an association with the disease state. Although methods that determine haplotype pairs directly by biological assays are currently available, they rarely are used due to their expense and incongruity to automation. Statistical methods provide an inexpensive, relatively accurate means to determine haplotype pairs. However, these statistical methods can provide erroneous results. In this article, the authors compare a standard statistical method for performing a haplotype-based association test with a method that accounts for the misclassification of haplotype pairs as part of the test. Under a number of feasible scenarios, the performance of the new test exceeded that of the standard test.
With the widespread availability of SNP genotype data, there is great interest in analyzing pedigree haplotype data. Intermarker linkage disequilibrium for micro-satellite markers is usually low due to their physical distance; however, for dense maps of SNP markers, there can be strong linkage disequilibrium between marker loci. Linkage analysis (parametric and nonparametric) and family-based association studies are currently being carried out using dense maps of SNP marker loci. Monte Carlo methods are often used for both linkage and association studies; however, to date there are no programs available which can generate haplotype and/or genotype data consisting of a large number of loci for pedigree structures. SimPed is a program that quickly generates haplotype and/or genotype data for pedigrees of virtually any size and complexity. Marker data either in linkage disequilibrium or equilibrium can be generated for greater than 20,000 diallelic or multiallelic marker loci. Haplotypes and/or genotypes are generated for pedigree structures using specified genetic map distances and haplotype and/or allele frequencies. The simulated data generated by SimPed is useful for a variety of purposes, including evaluating methods that estimate haplotype frequencies for pedigree data, evaluating type I error due to intermarker linkage disequilibrium and estimating empirical p values for linkage and family-based association studies.
Simulation; Pedigree structure; Type I error; Empirical p values
Association studies have been widely used to identify genetic liability variants for complex diseases. While scanning the chromosomal region 1 single nucleotide polymorphism (SNP) at a time may not fully explore linkage disequilibrium, haplotype analyses tend to require a fairly large number of parameters, thus potentially losing power. Clustering algorithms, such as the cladistic approach, have been proposed to reduce the dimensionality, yet they have important limitations. We propose a SNP-Haplotype Adaptive REgression (SHARE) algorithm that seeks the most informative set of SNPs for genetic association in a targeted candidate region by growing and shrinking haplotypes with 1 more or less SNP in a stepwise fashion, and comparing prediction errors of different models via cross-validation. Depending on the evolutionary history of the disease mutations and the markers, this set may contain a single SNP or several SNPs that lay a foundation for haplotype analyses. Haplotype phase ambiguity is effectively accounted for by treating haplotype reconstruction as a part of the learning procedure. Simulations and a data application show that our method has improved power over existing methodologies and that the results are informative in the search for disease-causal loci.
Adaptive regression; Haplotype; Multilocus analysis; SNP
With current technology, vast amounts of data can be cheaply and efficiently produced in association studies, and to prevent data analysis to become the bottleneck of studies, fast and efficient analysis methods that scale to such data set sizes must be developed.
We present a fast method for accurate localisation of disease causing variants in high density case-control association mapping experiments with large numbers of cases and controls. The method searches for significant clustering of case chromosomes in the "perfect" phylogenetic tree defined by the largest region around each marker that is compatible with a single phylogenetic tree. This perfect phylogenetic tree is treated as a decision tree for determining disease status, and scored by its accuracy as a decision tree. The rationale for this is that the perfect phylogeny near a disease affecting mutation should provide more information about the affected/unaffected classification than random trees. If regions of compatibility contain few markers, due to e.g. large marker spacing, the algorithm can allow the inclusion of incompatibility markers in order to enlarge the regions prior to estimating their phylogeny. Haplotype data and phased genotype data can be analysed. The power and efficiency of the method is investigated on 1) simulated genotype data under different models of disease determination 2) artificial data sets created from the HapMap ressource, and 3) data sets used for testing of other methods in order to compare with these. Our method has the same accuracy as single marker association (SMA) in the simplest case of a single disease causing mutation and a constant recombination rate. However, when it comes to more complex scenarios of mutation heterogeneity and more complex haplotype structure such as found in the HapMap data our method outperforms SMA as well as other fast, data mining approaches such as HapMiner and Haplotype Pattern Mining (HPM) despite being significantly faster. For unphased genotype data, an initial step of estimating the phase only slightly decreases the power of the method. The method was also found to accurately localise the known susceptibility variants in an empirical data set – the ΔF508 mutation for cystic fibrosis – where the susceptibility variant is already known – and to find significant signals for association between the CYP2D6 gene and poor drug metabolism, although for this dataset the highest association score is about 60 kb from the CYP2D6 gene.
Our method has been implemented in the Blossoc (BLOck aSSOCiation) software. Using Blossoc, genome wide chip-based surveys of 3 million SNPs in 1000 cases and 1000 controls can be analysed in less than two CPU hours.
Maximum parsimony phylogenetic tree reconstruction from genetic variation data is a fundamental problem in computational genetics with many practical applications in population genetics, whole genome analysis, and the search for genetic predictors of disease. Efficient methods are available for reconstruction of maximum parsimony trees from haplotype data, but such data are difficult to determine directly for autosomal DNA. Data more commonly is available in the form of genotypes, which consist of conflated combinations of pairs of haplotypes from homologous chromosomes. Currently, there are no general algorithms for the direct reconstruction of maximum parsimony phylogenies from genotype data. Hence phylogenetic applications for autosomal data must therefore rely on other methods for first computationally inferring haplotypes from genotypes.
In this work, we develop the first practical method for computing maximum parsimony phylogenies directly from genotype data. We show that the standard practice of first inferring haplotypes from genotypes and then reconstructing a phylogeny on the haplotypes often substantially overestimates phylogeny size. As an immediate application, our method can be used to determine the minimum number of mutations required to explain a given set of observed genotypes.
Phylogeny reconstruction directly from unphased data is computationally feasible for moderate-sized problem instances and can lead to substantially more accurate tree size inferences than the standard practice of treating phasing and phylogeny construction as two separate analysis stages. The difference between the approaches is particularly important for downstream applications that require a lower-bound on the number of mutations that the genetic region has undergone.
HLA haplotype analysis has been used in population genetics and in the investigation of disease-susceptibility locus, due to its high polymorphism. Several methods for inferring haplotype genotypic data have been proposed, but it is unclear how accurate each of the methods is or which method is superior. The accuracy of two of the leading methods of computational haplotype inference – Expectation-Maximization algorithm based (implemented in Arlequin V3.0) and Bayesian algorithm based (implemented in PHASE V2.1.1) – was compared using a set of 122 HLA haplotypes (A-B-Cw-DQB1-DRB1) determined through direct counting. The accuracy was measured with the Mean Squared Error (MSE), Similarity Index (IF) and Haplotype Identification Index (IH).
None of the methods inferred all of the known haplotypes and some differences were observed in the accuracy of the two methods in terms of both haplotype determination and haplotype frequencies estimation. Working with haplotypes composed by low polymorphic sites, present in more than one individual, increased the confidence in the assignment of haplotypes and in the estimation of the haplotype frequencies generated by both programs.
The PHASE v2.1.1 implemented method had the best overall performance both in haplotype construction and frequency calculation, although the differences between the two methods were insubstantial. To our knowledge this was the first work aiming to test statistical methods using real haplotypic data from the HLA region.
Numerous studies have attempted to relate genetic polymorphisms within the N-acetyltransferase 2 gene (NAT2) to interindividual differences in response to drugs or in disease susceptibility. However, genotyping of individuals single-nucleotide polymorphisms (SNPs) alone may not always provide enough information to reach these goals. It is important to link SNPs in terms of haplotypes which carry more information about the genotype-phenotype relationship. Special analytical techniques have been designed to unequivocally determine the allocation of mutations to either DNA strand. However, molecular haplotyping methods are labour-intensive and expensive and do not appear to be good candidates for routine clinical applications. A cheap and relatively straightforward alternative is the use of computational algorithms. The objective of this study was to assess the performance of the computational approach in NAT2 haplotype reconstruction from phase-unknown genotype data, for population samples of various ethnic origin.
We empirically evaluated the effectiveness of four haplotyping algorithms in predicting haplotype phases at NAT2, by comparing the results with those directly obtained through molecular haplotyping. All computational methods provided remarkably accurate and reliable estimates for NAT2 haplotype frequencies and individual haplotype phases. The Bayesian algorithm implemented in the PHASE program performed the best.
This investigation provides a solid basis for the confident and rational use of computational methods which appear to be a good alternative to infer haplotype phases in the particular case of the NAT2 gene, where there is near complete linkage disequilibrium between polymorphic markers.