Very rapid and inexpensive methods exist for determining the genotype of diploid organisms at single nucleotide polymorphisms (SNPs). Unfortunately, these high-throughput methods do not provide direct information on which SNP alleles at multiple sites coexist on the same chromosome. Instead, computational methods must be employed to infer the set of SNP alleles that are cosegregating on a single chromosome, referred to as haplotypes. However, the inference of haplotypes from phase-unknown data is computationally difficult, partly due to the fact that the number of possible haplotypes roughly increases as a power of 2 with each additional SNP.
Interest in the accurate inference of haplotype structure from unphased genotypic data has increased tremendously in recent years for several reasons. Relative to analysis of single polymorphisms, haplotypes can greatly improve one's ability to infer the evolutionary history of a DNA region [1
]. Additionally, haplotypes can provide significant increases in statistical power to detect associations between a phenotype and genetic variation [3
]. Indeed, several disease associations with haplotypes have been detected that were not apparent from single-site analyses [6
There are three principal computational approaches to inferring haplotypes from unphased SNP data. The most commonly used approach is implementation of the expectation-maximization (EM) algorithm [10
]. This method is computationally intensive and is usually combined with various strategies to simplify the task (i.e., by considering only subsets of the sites at a time) or to minimize the number of potential haplotypes that must be considered [11
]. A more recent alternative is application of Bayesian methods that incorporate prior expectations based upon population genetic principles [13
]. A third method based on parsimony ("subtraction method"; [16
]) has the limitation that haplotypes are assigned only in unambiguous cases [17
], and the level of ambiguity generally increases with the number of sites considered or the number of sites at which an individual is heterozygous. This limitation is expected to be significant in large-scale analyses of SNP variation, and for this reason the subtraction method is not considered here. Unfortunately, it is unclear how accurate the EM and Bayesian approaches are or whether the EM or Bayesian method is superior in inferring haplotypes, particularly when applied to empirical data. Data simulation [18
] can explore the effect of a wide range of parameters and population dynamics (i.e., linkage disequilibrium, selection, population substructuring) but is unlikely to achieve fully the complex combinations of these effects inherent in empirical data. On the other hand, comparisons using empirical data have been based on as few as six SNPs [17
] or have employed data sets in which the number of SNPs or known haplotypes equals or greatly exceeds the number of individuals sampled [13
]. Neither of these situations is likely to be an accurate reflection of the sample sizes or numbers of SNPs that will be assayed with the high-throughput methods available today. To understand the relative performance of the various methods of haplotype inference, there is a need for comparisons that include both larger numbers of polymorphic sites and biologically more complex correlations among the sites. In this study the performance of several leading methods of haplotype inference are compared for a large data set (154 individuals, 15 SNPs) undergoing a combination of mutation, recombination, and gene conversion.
The accuracy of computational haplotype inference improves as the magnitude of linkage disequilibrium (LD) among sites increases [17
]. Gene conversion, operating in conjunction with normal recombination, can complicate the normal decay of linkage disequilibrium with distance in a genomic region and can be expected to complicate the computational inference of haplotype structure. This issue has particular relevance to the human growth hormone locus. The five genes of the human growth hormone locus reside within about 45 kb on chromosome 17 [20
]. Pituitary growth hormone (GH1) is by far the most thoroughly studied of the genes and lies at the 5' end of the cluster. The remaining four genes, placental growth hormone (GH2) and three chorionic somatomammotropins (CS1, CS2, and pseudogene CS5 or CSHP1), are expressed only from the placenta. The promoter region of GH1 is unusually polymorphic, with 16 SNPs having been identified in a span of 535 bp [21
]. Most of these SNPs occur at the comparatively small number of sites that exhibit sequence differences among the five genes of the GH locus, and this has been interpreted as evidence of gene conversion [21
]. A survey of 25 SNPs in the entire promoter and coding region of GH1 (Adkins et al. in review) indicates that this bias towards polymorphism at sites of intergenic divergence is quite extreme and supports the hypothesis that gene conversion plays a role in the pattern of variation in the GH1 gene in addition to mutation and recombination. In 154 recruits to the British army, Horan et al. [23
] used cloning and sequencing to empirically determine 36 haplotypes based on 15 of the promoter SNPs previously identified (one site identified by [21
] was invariant). This study takes advantage of the exhaustive work of Horan et al. [23
] to compare the relative accuracy of some of the major implementations of the EM and Bayesian approaches to haplotype inference.