Single nucleotide polymorphisms (SNPs) are the most abundant type of polymorphism in the human genome. With the parallel developments of dense SNP marker maps and technologies for high-throughput SNP genotyping, SNPs have become the markers of choice for genetic association studies. The use of dense but incomplete maps of SNP markers for genetic association is based upon the premise that low penetrance but fairly common disease variants can be detected by virtue of indirect association between SNP markers and disease status. As a general rule, the denser the map of markers used, the greater the probability that at least one marker will be in strong linkage disequilibrium (LD) with a disease susceptibility allele, and therefore indirect association between marker and disease will be detected [1
With the development of genotyping platforms that permit analysis of several hundreds of thousands of markers, it is now possible to apply this principle of indirect association to the whole genome rather than just candidate genes or candidate linkage regions. For example Affymetrix (Santa Clara, California), recently released microarrays that can interrogate ~500,000 SNPs, and Illumina (San Diego, California) released in January 2006 the Sentrix(r) HumanHap300 Genotyping BeadChip which can genotype 317,504 high-value SNP loci derived principally from tag SNPs. Theoretical predictions [2
] as well as empirical data concerning the structure and distribution of LD in the human genome [3
] suggest that analyses on this scale will probably be adequate for whole genome association studies targeted at common disease variants.
The number of subjects required to detect the influence of a risk allele by indirect association depends upon the locus-specific genotype relative risks conferred by the susceptibility variant and the maximum LD between it and any assayed marker. For unknown loci, these parameters can only be guessed, but the expectation is that the relative risks will usually be small and therefore the required samples large. Substantial samples are also required to offset the enormous degree of multiple testing inherent in genome-wide studies. Thus an uncorrected threshold for statistical significance of α = 10-7
is required to achieve a genome-wide type I error rate of only 0.05 in the face of testing 500,000 independent SNPs. Although this is somewhat conservative since many markers are in LD (and therefore the tests are not independent), it serves as a rough approximation to the scale of the statistical burden. These dual considerations of small genetic effect sizes and adjustment for multiple testing have led many to assume that samples in the region of at least 1000 or more cases and a similar number of controls will be required for most complex disorders [e.g. [4
]]. Given these expected sample sizes, while genome-wide association are indeed technically feasible, they are also expensive.
One way to reduce the cost is to undertake quantitative analyses of allele frequencies in DNA pools, a process often referred to as 'DNA pooling' [7
]. Here, equal amounts of DNA from patients and controls are mixed to form two sets of pools. The pools are then genotyped and the frequency of each allele estimated. The power of such studies is approximately the same as for individual genotyping of cases and controls [4
], but at a hugely reduced cost. DNA pooling has proved remarkably accurate when applied to simple tandem repeats [10
] or to SNPs using a variety of different genotyping technologies [7
]. Typically, when estimates of allele frequency differences between two pools are compared with those obtained by individual genotyping, the mean error rate of pooled analysis is in the region of 1–2%.
Several groups have begun to apply pooled genotyping to the new ultra-high throughput genotyping technologies. Butcher et al, 2004 [14
] and Meaburn et al, [15
] pioneered this method by assessing the performance of the Affymetrix 10 K Array Xba 131 for pooled genotyping. They validated by individual genotyping pooling data obtained from 10 SNPs in their first experiment [14
] and 104 SNPs in the follow-up work [15
]. They also compared the pooled data from the remaining markers on the chip with allele frequency data from a reference Caucasian population. The same group recently [16
] reported an applied DNA pooling study based upon the 10 K Array with mild mental impairment as a phenotype. They followed up the pooling data for the 12 most significant markers by individual genotyping in a larger replication sample. Four of these SNPs remained significantly associated. Liu et al, [17
] recently reported the results of a study where pools of 20 individuals each were used to identify differences between substance abusers and controls (a total of 1253 individuals were genotyped). This strategy allowed them to identify 38 "nominally reproducibly positive" SNPs.
Although these studies give cause for optimism, it is clear that the validity of pooled genotyping using array technology has not been proven for a sufficiently large number of SNPs to allow researchers to apply the method with confidence. In this paper, we have undertaken a more comprehensive analysis of the accuracy of microarray-based pooling experiments. Rather than examine a small selection of SNPs, we examined 6843 fully informative SNPs out of a total of 10,204 SNPs represented on the Affymetrix 10 K Xba 142 2.0 array. Our results suggest that pooled genotyping using Affymetrix arrays is as accurate as that obtained with lower throughput platforms, and that it can be performed instead of individual genotyping with only a minimal loss of power.