Single Nucleotide Polymorphisms (SNPs) are single base pair differences between individuals in a population. The recent completion of the Human Genome Project has helped facilitate the discovery of millions of SNPs and their use in genetic association studies for human disease [1
]. Association studies work on the premise that SNP genotypes are correlated with a disease phenotype. Individual SNPs are genotyped and the frequency of alleles are compared between groups of affected and un-affected individuals. SNPs that are tested for association either must be the causative allele or be in linkage disequilibrium
(LD) with the causative allele. LD is the non-random association of alleles between adjacent loci [2
]. SNPs that are in LD with causative allele serve as a proxy and the association with the disease phenotype is maintained.
Numerous studies have shown that the human genome contains regions of high LD with low haplotype diversity [3
]. These regions are called haplotype blocks
. The existence of haplotype blocks reduces the number of SNPs required in association studies by identifying and typing only the subset of tag SNPs which uniquely identify common haplotypes present in a block. The frequencies of these haplotypes can be compared in groups of affected and unaffected individuals [7
Haplotype blocks are defined computationally by various algorithms and can be classified into three categories: diversity based [3
], LD-based [6
], and information-theoretic [9
]. Patil et al. [3
] used a diversity based greedy algorithm to partition Chromosome 21 into haplotype blocks in a sample of 20 re-sequenced chromosomes. Their algorithm considers all blocks of consecutive SNPs of one SNP or larger, and defines a haplotype block boundary where at least 80% of observed haplotypes within a block are represented at least one or more times in their sample of chromosomes. Overlapping block boundaries were eliminated by choosing the block with the maximum ratio of SNPs in the block to the number of SNPs required to discriminate all haplotypes represented in the block. The process was repeated until the entire length of the chromosome was partitioned into haplotype blocks. Zhang et al. [8
] subsequently provided a dynamic programming implementation for this approach in their software HapBlock [10
Gabriel et al. [6
] used a LD-based algorithm to define haplotype blocks in a worldwide sample of chromosomes from Africa, Asia, and Europe. The authors computed confidence bounds of the value of D'
, a standard measurement of LD [11
], and defined pairs of SNPs to be in strong LD (little evidence of recombination) if the one-sided 95% D'
confidence bound is between 0.7 and 0.98. The authors defined a haplotype block if least 95% of pairwise SNP comparisons in a region show little evidence of recombination based upon their D'
confidence bounds. The program Haploview [12
] implements this method of Gabriel et al.
Anderson and Novembre [9
] use the Minimum Description Length (MDL) principle for defining haplotype blocks which incorporates LD decay between blocks and haplotype diversity within blocks [9
]. The MDL principle is an application of information theory to statistical modeling which searches for patterns in data [13
]. The description length of a data set is a function of the length with which data can be encoded in binary digits, or bits [9
]. The best set of block boundaries defined by Anderson and Novembre's method is the set of block boundaries that has the shortest description length for a set of SNP genotypes that span a genomic region. The authors use a dynamic programming algorithm they call the iterative dynamic programming algorithm
(IDP) and a faster, but approximate, dynamic programming algorithm called iterative approximate dynamic programming algorithm
(IADP) to find the minimum description length for a set of haplotypes. Their method is implemented in the program MDBlocks [9
Previous studies on the empirical performance of block partitioning methods have focused on data sets with differing minor allele frequency cutoffs. The studies of Daly et al. [5
], Patil et al. [3
], and Gabriel et al. [6
] used minor allele frequency cutoffs of 5%, 10%, and 20%, respectfully. Schulze et al. [14
] assessed the effects of varying the minor allele frequency cutoff on the number of blocks and tag SNPs inferred by the LD based method of Gabriel et al. [6
] and diversity based method of Zhang et al. [8
]. As rarer SNPs were removed and the allele frequency cutoff raised, the number of blocks inferred decreased for both methods, showing that the block structure is highly influenced by the allele frequency of SNPs used in their analysis.
Ke et al. [15
] studied the impact of SNP density on block boundaries from three different partitioning algorithms: the previously discussed LD approach of Gabriel et al., the four-gamete test [16
], and a D'
threshold approach of Phillips et al [17
]. The author's study genotyped over 5000 SNPs in a 10 Mb region of chromosome 20 in four different populations: CEPH families, U.K. Caucasians, African Americans, and East Asians. Block boundaries of the algorithms were assessed with differing marker densities starting at 2 kb and going to 10 kb. Their results show that longer blocks at sparser densities are broken into smaller blocks as more SNPs are added in. Other studies describing the LD block structure of the human genome also used varying marker densities. The study by Phillips et al. [17
] on chromosome 19 used an average marker density of one SNP per 17.65 kb with a median value of 5.5 kb. Gabriel et al. [6
] used an average density of one SNP every 2 kb. Daly et al. [5
] used a density of one marker approximately every 5 kb. Patil et al. [3
] used a higher density of SNPs with one SNP every 1.3 kb. This study was also the only one that completely re-sequenced the entire chromosome for all 20 samples.
The ideal data set to fully assess the performance of block partitioning algorithms would be a comprehensively re-sequenced large genomic region in a large number of independent chromosomes. Unfortunately, such data are not available at this time. Only a limited number of samples have been re-sequenced extensively. In addition to the study by Patil et al., as of June 2005 the SeattleSNPs [18
] data set has re-sequenced 234 human genes in 24 African-American and 23 European CEPH samples spanning a total of 4868 kb of sequence. The ENCODE project [19
] intends to re-sequence five 500 kb genomic regions in the 48 individuals of the HapMap Consortium data set [20
Therefore, to fully assess the performance of block partitioning algorithms we generated three populations consisting of 1000 haplotypes using the coalescent, a stochastic technique that simulates the genetic history of a sample of chromosomes [11
]. Haplotypes representing a 200 kb chromosomal region for three world populations – European, African American, and East Asian – were simulated using an implementation of the coalescent that uses a population-specific demographic history. The population specific profiles we used were previously published in Marth et al. [21
], where the authors derive a closed mathematical formula for computing the allele frequency spectrum for a specified demographic profile. The demographic profiles for each of the populations were derived by computing allele frequency spectra predicted by Marth's equation for numerous demographic scenarios and testing the fit between it and the observed spectra from the SNP Consortium data set [1
] for each respective population.
In the study presented here, we partitioned our coalescent-derived haplotypes into blocks using the three algorithms described above (diversity based, LD based, information theoretic). We assessed algorithm differences in number, size, and coverage of blocks under different values of marker density, allele frequency, and sample size on the performance of block partitioning algorithms. Our results show a great divergence in haplotype blocks predicted by each method, and supports the notion that it may be advisable to use multiple algorithms in parallel to comprehensively account for all haplotype blocks in the human genome.