As described earlier, the haplotyping methods assuming linkage equilibrium among markers can produce inaccurate results, in particular for tightly linked (SNP) markers in LD [7
], and use of this haplotype information in linkage or association studies can adversely affect mapping accuracy [7
To account for LD among tightly linked markers, population-based haplotyping methods using unrelated individuals were developed [10
]. Because family data provide substantially more information for inferring haplotypes than samples of unrelated individuals [1
], several of these population-based methods were extended to use nuclear families, father-mother-child trios, or sibships [1
The extended methods using nuclear families or trios [e.g., 1
] assume that all parents in the nuclear families or trios are sampled independently from a population in HWE, and they often assume that no recombination occurs in the transmission of haplotypes from the parents to children. These methods infer haplotypes for the independent parents by using the population-based approaches and by excluding the parental haplotype pairs that are not consistent with the children's genotype data. This idea is similar to that of the rule-based method of O'Connell [37
]. The extended methods account for LD among markers and can jointly use population data and nuclear family or trio data.
We note that the purpose of these extended methods is to infer haplotypes and estimate haplotype frequencies for parents (founders), rather than to infer haplotype configurations for entire families (or pedigrees) as done by the methods described in previous sections. The extended methods using trios can be applied to inferring haplotypes for genome-wide SNP markers (say 1 million SNPs [1
]), whileas the rule-based methods assuming no recombination and using founder population haplotyping frequencies, such as ZAPLO [37
] and HAPLORE [38
], are more appropriate for tightly linked markers in short chromosomal regions (e.g., candidate genes). This is so because these rule-based methods are designed for larger pedigrees, where haplotyping becomes computationally intensive. Below we review the extended methods.
Maximum likelihood methods implemented via an EM algorithm have been widely used to estimate haplotype frequencies in population data under the assumptions of HWE and random mating [e.g., 75
]. Rohde and Fuerst [23
] applied a maximum likelihood EM algorithm to haplotyping parents in nuclear family data with dense markers and showed that this method had higher haplotyping accuracy than the software GENEHUNTER [15
], which assumes linkage equilibrium among markers. However, this method can only be applied to 30 or fewer biallelic loci [23
To accommodate a large number of (SNP) markers in nuclear families, Lin et al. [5
] proposed a haplotyping method based on a Bayesian MCMC approach incorporating a variant of the partition ligation method of Niu et al. [77
]. The method groups (SNP) markers into high LD blocks, reconstructs haplotypes for subgroups of markers within each block, and then reconstructs haplotypes for blocks. It can analyze thousands of dense SNPs and more than 1,000 chromosomes.
The process of using genotype information on children in nuclear families to help reconstructing parental haplotypes in the method of Lin et al. [5
] essentially assumes that no recombination occurs in the transmission of haplotypes from the parents to children, but recombination can be accommodated to some extent as follows. For example, if a child has genotype G1
at four loci with a recombination between the 2nd and 3rd markers in one of the child's haplotypes, then the genotype is split into two genotypes G1G2
00 and 00 G3G4
, where 0 is a missing genotype. This process can not handle the families (rare cases) with multiple children in which every child inherits a recombined haplotype or in which a child inherits two recombined haplotypes.
Marchini et al. [1
] described the extension of five of the leading population-based haplotyping methods to use father-mother-child trios. These five extended methods (including the method of Lin et al. [5
]) incorporate the partition ligation of Niu et al. [77
] and are able to process thousands of high-density markers. The extended methods include Bayesian approaches using coalescent-based models (e.g., the PHASE (v2.1) algorithm) [78
], a perfect phylogeny approach using constrained maximum likelihood [80
], and a maximum likelihood EM method called tripleM [73
]. The coalescent-based models attempt to capture the fact that over short genomic regions, sampled chromosomes tend to cluster together into groups of similar haplotypes and the perfect phylogeny approach also accounts for this ‘clustering property', while tripleM does not. PHASE (v2.1) can also internally re-estimate a variable population-scaled recombination rate across the region being considered [1
Marchini et al. [1
] comprehensively compared the five extended methods when applied to both trios and unrelated individuals by using data simulated based on the coalescent model as well as data from the HapMap project (http://www.hapmap.org/
). All methods provided highly accurate estimates of haplotypes when applied to trio data sets. Overall the PHASE (v2.1) algorithm had the highest accuracy for all data sets considered. Although it is one of the slowest methods, PHASE (v2.1) was used to infer haplotypes for the 1 million-SNP HapMap data set [1
All methods extended from population-based approaches described above essentially assume that no recombination occurs in the transmission of haplotypes from parents to children in the trio or nuclear family data. This assumption is reasonable for high-density biallelic (SNP) markers in a short chromosomal region (at most several megabases) but it may not be appropriate for a long chromosomal region (tens of centimorgan) in families with multiple children or in pedigrees. Multiple children in a family may provide more information for inferring haplotypes than a single child. In addition, the extended methods cannot deal with (multi-generational) pedigrees. Inferring haplotype configurations in pedigrees by modeling LD and recombinants among markers is useful for fine mapping in linkage analysis and association studies. As stated earlier, the method of Abecasis and Wigginton [22
] developed for pedigrees with clustered marker data can account for marker-marker LD within each cluster and recombination between clusters. A possible problem is that ignoring LD among markers from different clusters may generate inaccurate results when analyzing a large number of high-density markers.