Despite reasons to expect that SNP-MaP estimates for the 100K microarray set might not be as reliable and valid as the 10K microarray, the present results for the 100K microarray set are as promising as our previous results for the 10K microarray (7
). For the 100K microarray set, allelotyping of SNPs for pooled DNA was as successful as individual genotyping—95% of SNPs yielded interpretable results (DSsnp
≥ 0.04), which means that results for nearly 110
000 SNPs can be expected for the 100K microarray set. Concerning reliability, the average correlation among the five subpools in the present study using the 100K microarray set was 0.969; our previous work on the 10K yielded an average correlation of 0.955. Concerning validity, the average correlation for the 100K microarray set was 0.939 between our SNP-MaP allele frequency estimates using pooled DNA allelotyped and individual genotyping results from the NetAffx standardization sample; for the 10K microarray, the average correlation was 0.904.
As expected, k-correction made no difference for reliability which involves relative comparisons between comparing allele frequency estimates for different groups; it should be emphasized that relative comparisons between groups such as case versus control groups is the purpose of SNP-MaP. For validity comparisons between SNP-MaP allele frequency estimates from pooled DNA and estimates based on individual genotyping, k-correction improved the correlations.
Despite the high reliability of SNP-MaP estimates, the average difference in allele frequency estimates from pooled DNA is 0.036. In other words, SNP-MaP can only detect allele frequency differences >0.036 between two DNA pools. In order to increase the sensitivity to detect allele frequency differences between groups, multiple DNA pools of independent subsamples from each group are recommended. The use of multiple independent DNA pools also permits the use of parametric statistics because it assesses sampling variation. With five independent subpools as in the present experiment, the SD is 0.041, which implies that allele frequency differences of 0.075 between groups (e.g. allele frequencies of 0.500 for cases and 0.575 for controls) can be detected with 80% power (P = 0.05, two-tailed). Doubling the number of replicate DNA pools from 5 to 10 pools does not alter the SD but will alter the SEM by a function of the square root of the number of replicates. That is, with five replicates we observed a mean SEM of 0.19, whilst 10 replicates should yield an SEM of ~0.013, which would yield 80% power to detect differences of 0.053 and 99% power to detect differences of 0.082. Although doubling the cost of an experiment seems a considerable price to pay for these small gains in power, we advocate the use of 10 replicates in order to maximize power to detect QTLs of small effect size.
We also recommend that individual genotyping be used to confirm SNP-MaP screening. Because SNP-MaP estimates of allele frequency involve errors of estimation due to pooling DNA, group differences in allele frequency estimates will be reduced when SNPs nominated by SNP-MaP are individually genotyped. For this reason, it is unlikely that allele frequency differences between groups as small as 0.05 can be detected reliably—a more reasonable target is SNP-MaP differences >0.10. Power to detect allele frequency differences at the confirmation stage of individual genotyping depends directly on sample size.
Although power is the crucial issue in detecting QTL associations of small effect size, the issue of the balance between false positive and false negative results becomes especially important when so many tests are conducted. For example, using a nominal P
-value of 0.05, 5000 statistically significant results are expected by chance alone; winnowing the true results from the false positives will be difficult to resolve statistically. Although the obvious statistical solution is to increase the P
-value to protect against false positive results due to multiple testing (21
), a multistage approach could provide a better balance between false positive and false negative findings (3
). In the end, the solution to this conundrum will be empirical rather than statistical: independent replication.
It is generally agreed that >100
000 SNPs are needed for genomewide association scans. Because the SNP-MaP approach works equally well for the 100K microarray set as for the 10K microarray, we anticipate that the approach will also work for the 500K microarray set which is now available. It should be mentioned that the SNP-MaP approach is also likely to work for any other SNP microarrays such as gene-based microarrays, or microarrays with functional SNPs that would permit more powerful direct association analyses rather than indirect association analyses that rely on linkage disequilibrium between SNPs and QTLs.
Limitations of the SNP-MaP approach include the additional error that comes from estimating an average allele frequency from pooled DNA rather than from each individual. Accuracy would of course be better if each individual's DNA were genotyped on separate microarrays, but the expense would prohibit most researchers from studying the very large samples needed to detect QTLs of very small effect size. For example, assuming a cost of £500 per 100K microarray set, it would cost £500
000 to genotype a sample of 1000 individuals on separate microarrays. In comparison, a SNP-MaP case–control study using 10 independent case pools and 10 independent control pools with a replication design of an additional 10 case and 10 control pools would cost £30
000, including the cost of DNA pool construction. The cost of confirmation with individual genotyping will then depend largely upon how many statistically significant SNPs are selected. Assuming a cost of £0.05 per genotype, even if 4700 SNPs (far more SNPs than would reasonably be followed up) were individually genotyped the total SNP-MaP study cost would be half that of using separate microarrays—£250
Our results indicate that SNP-MaP approach yields substantial reliability and validity to screen for the largest allele frequency differences between case and control groups. A greater limitation is that pooled DNA can only be used to estimate allele frequencies rather than genotypic frequencies, which means that haplotypes cannot be investigated at the SNP-MaP screening stage, although haplotypes could be incorporated into individual genotyping strategies at the confirmation stage. These costs are offset by the tremendous benefits of screening many thousands of SNPs using the very large samples needed to detect QTLs of small effect size.