Owing to the availability of high-throughput genotyping technologies and the comprehensive coverage of common genetic variants by the HapMap project (The International Hapmap Consortium, 2005), (The International Hapmap Consortium, 2007), association studies are widely used to dissect the genetic basis of complex diseases in a scope varying from a number of candidate genes to the whole genome. A typical association study involves initial prioritization of single nucleotide polymorphism (SNP) genotypes in a small subsample or a selection of tagSNPs derived from an existing database such as the HapMap project. These tagSNPs are subsequently genotyped for a sample of cases and controls (Smith and others, 2007
). While the causal variants may not be interrogated directly, it is hoped that linkage disequilibrium (LD) mapping could narrow the search down to a small neighborhood around the causal variants. However, despite the explosion of genetic information available, challenges remain for statistical analyses due to the diversity of LD patterns in the human genome (The International Hapmap Consortium, 2005), the sheer number of SNPs being genotyped, and the complex nature of common disorders. Currently, the single-SNP scan and multiple-SNPs haplotype analyses are 2 commonly used approaches. The power comparison between these 2 approaches is somewhat inconclusive, as it depends on underlying disease models and local LD patterns (Morris and Kaplan, 2002
), (Roeder and others
, 2005). It has been suggested that a single-SNP scan is an effective method to detect common disease alleles, while haplotype-based methods are useful to map more recent, relatively rare mutations (Lin and others
, 2004), (Schaid, 2004), though strategies to construct informative haplotypes (clusters) are far from mature. This paper pertains to adaptive SNP/haplotype analysis exploiting LD among SNPs in a candidate chromosomal region.
When many SNPs in a targeted chromosomal region are under investigation, a naive haplotype analysis using all SNPs is often ineffective due to the large number of haplotypes and hence too many degrees of freedom in an omnibus test. Instead, one may first dividing SNPs into haplotype blocks of high LD and then performing a haplotype analysis in each block (Barrett and others
, 2005). However, the block definition itself is arbitrary, and typically, there is substantial correlation not captured between blocks. An alternative strategy is to construct a genealogical tree of haplotypes, known as a cladogram, and study the correlation between the disease phenotype and the clusters (clades) of haplotypes, thereby reducing the dimensionality of haplotype analyses (Templeton and others
, 1987), (Seltman and others
, 2001), (Molitor and others
, 2003), (Durrant and others
, 2004), (Morris, 2006
). The motivation is that the causal allele should be embedded within the cladogram that describes the evolution of the sampled chromosomes. However, an accurate construction of the underlying cladogram typically relies on the assumption that there is no recombination. This is hardly true for any given region because of background recombination in the human genome, particularly for regions near or within recombination hot spots. To this end, a sliding window approach was proposed in the hierarchical clustering algorithm called CLADHC (Durrant and others, 2004
), yet the optimal window size cannot be universal due to the diversity of local LD through the human genome. Even in an extreme scenario with complete LD, it was pointed out that cladistic approaches cannot be optimal in all disease models (Clayton and others, 2004
) since the rule of clustering haplotypes is based solely on genotypic data.
Other strategies for multilocus analyses exist (e.g. Browning, 2006
, Yu and Schaid, 2007
, Li and others, 2007
). These methods generally assume that local LD structures are somewhat contiguous, thus the order of SNP locations is critical. It is possible that SNPs that are separated apart can display strong LD, so a contiguous scan might miss signals. Similarly, multiple nonsynonymous mutations in a gene may disrupt the function of its coded protein jointly, possibly with interactions, regardless of their order in the chromosome. Furthermore, all aforementioned methods do not account for extra variability incurred by phase ambiguity in the model searching process, except the computationally intensive MCMC approach (Morris, 2006
In this article, we propose SNP-Haplotype Adaptive REgression (SHARE), an adaptive algorithm that searches for a subset of SNPs, which fully capture genetic association in a candidate chromosomal region. The selected set of SNPs is the most informative in a heuristic sense: adding more SNPs introduces noise and excluding any SNP in the set may lose information. Contrary to the cladistic approaches, where the clustering process depends solely on haplotypes, in our algorithm, both the trait and the genotypes guide the model selection process, and the SNP selection is irrespective of the order of the SNPs. Depending on the genealogy and the ancestral recombination among disease liability mutations and markers, the most informative set may contain a single SNP or several SNPs that lay a foundation for haplotype analyses, thereby effectively integrating a single-locus scan and a haplotype analyses into 1 unified framework. Furthermore, our algorithm stands apart from existing methods in that it accommodates phase ambiguity seamlessly by treating the inference of haplotypes as part of the procedure. The method is tailored to genetic association studies with a fair number of tagSNPs genotyped in a candidate gene approach, but, as we address in the Section 4, it can be extended to genome-wide association studies.