Utilizing the current chip technology, genome-wide scans with hundreds of thousands of single nucleotide polymorphisms (SNPs) in thousands of individuals is affordable for large-scale association studies (Barrett and Cardon, 2006
). Several genome-wide studies have already been published (Amundadottir et al.
, Arking et al.
; Smyth et al.
; Amundadottir et al.
; The Wellcome Trust Case Control Consortium, 2007
) and as the genotyping price continues to drop, we expect to see many more in the near future. With datasets of such sizes, the need for efficient, accurate association mapping methods is evident. Many studies resort to a marker-by-marker approach—e.g. a simple Fisher's exact test or χ2
-test—but, unless the trait-influencing variants are typed, its power is limited by the indirect testing through linkage disequilibrium (LD), and multi-marker approaches are generally preferred (Pe'er et al.
). A trade-off must be made, however, between sophistication and computational tractability.
Recently, one of us has developed a new multi-SNP method called Blossoc
(Mailund et al.
) that, although of similar accuracy, is orders of magnitude faster than other multi-SNP methods, and is capable of analyzing whole-genome data in a few CPU hours. Blossoc
resembles other recent methods, e.g. that of Zöllner and Pritchard (2005
), Minichiello and Durbin (2006
) and Clark et al.
). It constructs local tree-like genealogies along the genome and scores those genealogies according to how the cases and controls are clustered, the motivation being that near a disease-predisposing SNP, the cases will tend to cluster together in the underlying genealogy. Compared with the other methods, Blossoc
achieves its much faster running time by taking a simpler approach to how local trees are constructed. Instead of sampling local trees from the coalescent with recombination and averaging scores over the sampled trees, it relies on a deterministic, efficient algorithm to build a single tree for each locus, assuming the infinite-sites model of mutation. Sevon et al.
) have recently proposed a method that also constructs a single tree per locus. Their approach differs from Blossoc
in how local trees are constructed and scored. Whereas Sevon et al.
) use a time-consuming permutation test to score trees, Blossoc
considers each tree as a decision tree and scores it with standard methods from the data mining literature (see Mailund et al.
) for details). Tachmazidou et al.
) construct local trees using the same approach as Blossoc
, but score them using a sophisticated MCMC algorithm that is relatively time consuming.
As a consequence of its simple approach to tree construction and scoring, Blossoc
is very computationally efficient. Further, computer experiments shown in Mailund et al.
) indicate that this efficiency is achieved with little, if any, loss in accuracy compared with more sophisticated methods. However, a major limitation of the original version of Blossoc
, as is also true for other methods, is its reliance on having phased haplotype data. Even with fast
PHASE (Scheet and Stephens, 2006
), phasing a whole-genome dataset requires tens of days of CPU time, making this step the major bottleneck when using computationally efficient methods such as Blossoc
In this article, we devise a method that eliminates the need for preanalysis phasing of genotypes into haplotypes. Our approach combines the ideas in Blossoc
with a recently found linear-time algorithm (Ding et al.
) for phasing genotypes on trees. This way, our new method builds local phylogenies directly from unphased data. Inferring trees from unphased genotype data is slightly slower than inferring trees from phased haplotype data, but our method is still capable of scanning the entire human genome in a few days. We also develop a new Bayesian score for the association between a local tree and the disease phenotype. Using simulated datasets, we compare the mapping results for unphased data with that for the true haplotype data and show that there is little loss in accuracy or ranking quality. As a proof of principle, we apply our method to analyze a genome-wide dataset for Parkinson disease (Fung et al.
). We remark that our tree construction algorithm is not restricted to Blossoc
; other association mapping methods based on scoring local genealogies, such as Sevon et al.
) and Tachmazidou et al.
), may also be generalized in the same way, enabling them to analyze unphased data.