Statistical inference of haplotypes from multilocus genotypes is the most common method to retrieve haplotype information lost during multisite genotyping. By testing the ability of inference methods to reconstruct true haplotypes from shuffled phase-known data, we find some sobering messages for those seeking to use inferred haplotypes for subsequent analysis. For example, in the KLK region (48 kb long and 400 SS), inferred haplotypes contain an average of 25-30% misassigned ambiguous SS. Errors are typically clustered and performance can be improved by removing rare SS or considering shorter regions, but even then the accuracy is considerably low.
Very recently, Marchini et al. 
assessed the accuracy of statistical haplotype inference of unrelated individuals using haplotypes inferred from HapMap family data. That study concludes that phasing accuracy is high even for unrelated individuals. Several facets of that dataset differ from ours, including the origin of phase information (since family-reconstructed haplotypes contain some sites of unresolved phase), ascertainment of polymorphisms (HapMap SNPs are ascertained ~1 SNP every 5 kb), sample size, and origin (Marchini et al. 
considered a single, larger, European sample), and number of regions (that study considered 100 regions). Despite those differences, the two studies reveal similar results. gSSE (Incorrect genotype percentage
in Marchini et al. 
), is very similar between the two studies, and SwE, in this study matches Marchini et al. 
for the common SS datasets and it is up only by a factor of 2 when considering all SS. Nevertheless, conclusions differ mainly due to differential use of error measures. We use SSE as a measure of single site error, which represents the percentage of ambiguous sites incorrectly assigned to haplotype. In contrast, Marchini et al. 
chose gSSE, which represents the percentage of incorrectly assigned sites among all sites, ambiguous or not. Given the large number of homozygote sites at a given individual, gSSE is necessarily lower than SSE (see ). In our opinion, SSE is a better measure of accuracy (since it is independent of site frequency) and it better reflects the uncertainty in subsequent analysis introduced by haplotype reconstruction (since heterozygote sites in individuals are the ones that discriminate among their haplotypes).
Besides the origin of phase information and ascertainment of SNPs, our data differ from recent assessments [Stephens and Scheet, 2005
; Marchini et al., 2006
; Scheet and Stephens, 2006
] in that our sample contains both AA and EA individuals, exactly as one finds in association tests in the US. According to our results, this mixed ancestry does not negatively affect the phasing process. In fact, our results suggest that in such a mixed ancestry (AA and EA) sample probably the best phasing results are obtained by reconstructing haplotypes on the combined sample, at least for small samples. Such pooling increases sample size, especially beneficial in small samples. In addition, the presence of EA chromosomes in the sample may help in the reconstruction of AA haplotypes, as variability outside of Africa is mostly a subset of African variation [Tishkoff et al., 1996
; Reich et al., 2001
; Gabriel et al., 2002
; Kidd, 2004
]. Moreover, pooling of populations may result in deviations from Hardy-Weinberg equilibrium toward excess of homozygosity. Even if deviations from equilibrium violate the assumptions of the methods, the algorithms seems robust to such deviations [Stephens et al., 2001
] and, in cases of increased homozygosity, such deviations may improve haplotype inference by reducing the percentage of heterozygote sites and increasing LD.
Demographic history influences accuracy of haplotype inference not only by stratifying populations, but also because recent expansion has resulted in a high proportion of rare SNPs and haplotypes, which hamper the reconstruction. This issue can be overcome by simply not considering rare SS, but such data truncation is not desirable in many re-sequencing efforts where discovery of all variation may be a key assumption of statistical models to be applied to the data. Past action of natural selection can also negatively affect haplotype reconstruction, either by increasing the coalescence time of chromosomes (by balancing selection) or by creating local genealogies similar to those of population expansions (by positive selection).
It is important to note that our study is centered on a single genomic region and two specific human populations, and the generality of these results is unclear. Nevertheless, LD and haplotype structure do not seem unusual for this regions [Shimmin et al., in preparation]. Moreover, accuracy in the KLK region is similar to the average of 100 regions from Marchini et al. 
and the 134 regions from Kukita et al. 
, suggesting that this region is probably representative of the genome. Note that both of these studies considered ascertained SNPs, at intermediate frequencies and low densities. The concordance of that accuracy with our data, very dense in SS, suggests that, contrary to previous expectations, increasing the density of the SNPs will not drastically improve haplotype reconstruction.
Unfortunately, simulation results show that the most important factors affecting haplotype reconstruction are, besides length of the region, those over which the investigator has little control (including the number of sites, demographic history of the population, or recombination rate), while elements that researchers can easily modulate (sample size or stratification of that sample) are somewhat less influential. Moreover, the reconstruction is considerably less accurate in real data than in simulations, revealing that additional factors not accounted for in our simulations may be hampering the inference. These include heterogeneity in mutation and recombination rates, gene conversion, and recurrent mutation, or a more complex demographic history than that considered here. The observation that haplotype inference could be even more difficult for studies based on tagSNPs is especially troublesome, given the extensive anticipated use of tagSNPs in association studies.
In principle, reconstruction of haplotypes can be considerably improved by the addition of extrinsic evidence that facilitates the statistical inference. Experimental determination of ambiguous phase by allele-specific amplification can dramatically improve phasing, even when limited to a small number of SNP pairs and individuals [Clark et al., 1998
]. Unfortunately, this method is not suitable for large-scale studies because experiments are individually designed along the phasing process, the technique is methodologically complex, and careful interpretation of agarose gels post-PCR is required. An alternative is to obtain haplotype information from pedigree data [Schaid, 2002; Schouten et al., 2005
]. For example, by genotyping mother-father-child trios, the HapMap project considerably improved many of its haplotype inferences [The International HapMap consortium, 2005
; Marchini et al., 2006
]. The disadvantage of this strategy is that the use of trios triples the study sample size (increasing costs) and requires access to family members, which may be unavailable. An additional possibility would be to use a set of “known” haplotypes as ‘predefined haplotypes’ to help phasing genotype population data. These could be obtained from HapMap data (for CEPH and Yoruban), from sequence or genotyping of monosomic cell lines for candidate loci, or from the application of novel techniques to obtain phase information of long genomic regions [Kukita et al., 2005
; Raymond et al., 2005
Regardless of the method employed for phasing, uncertainty of the reconstruction should ideally be incorporated in subsequent analysis, especially in association testing. This does not seem a straightforward solution even if analyses were to integrate the uncertainty information provided by phasing software, given the complex relation between accuracy of phase inference and confidence reported by the algorithm. This relationship may be improved by new methods like fastPHASE, but at the price of lower accuracy [Scheet and Stephens, 2006
and this study].
In association studies, probably the simplest solution would be to treat the phase information as implicit in unphased genotypes, avoiding explicit haplotype phase inferences. The use of unphased genotype data has been proposed for LD analyses [Weir and Cockerham, 1989
; Schaid, 2004
] and for disease association mapping [Clayton et al., 2004
; Morris et al., 2004
], and these studies demonstrate that using unphased genotype data may have similar power and less error-associated problems than haplotype-based methods. In other cases, like in population-genetic studies where the structure of haplotypes is the object of interest, showing that results are not dependent on the phasing method could support their robustness. In all cases, if haplotypes must be reconstructed, it would be wise to focus exclusively on regions of high LD (where phasing is more accurate) and to avoid reconstructing haplotypes across very long regions unless only extremely close SS are to be considered (e.g. in sliding window approaches).
Haplotype reconstruction is a valuable statistical tool that plays an essential role in a wide variety of genetic studies. It is important to recognize the extraordinary improvement of the methods over the last 15 years, to the point where highly complex inferences provide useful results. It is equally important, though, to face their limitations. Haplotype reconstruction based solely on genotype data remains a challenge, and in many cases the underlying biology is just too complex to be completely predicted by statistical algorithms. In order to reduce the effect of haplotype inaccuracies in subsequent analysis, some possible strategies include the introduction of external haplotype information, the restriction of inferences to specific regions of high LD, or the explicit accommodation of a distribution of admissible haplotypes to test robustness of subsequent inferences that use haplotype information.