To experimentally evaluate the efficacy of our multimarker method, we used the HapMap dataset to compare individual genotyping and pooling under an example GWA study. From the HapMap project, we are able to retrieve the genotypes for the CEU population. We randomly split CEU trios into two separate pools, consisting of 41 individuals in pool A and 47 individuals in pool B to create a model GWA study whereby the genotypes for each individual were certain. Due to sample quality, we excluded one individual from a given trio from each pool. Both pools were run on duplicates Illumina 550K v3 arrays, Illumina 450S Duo arrays and Affymetrix 5.0 arrays, respectively. For each microarray, we removed the lowest 1% of raw intensity values and normalized the microarray by dividing by the mean channel intensity.
We were able to probe 504 604 SNPs on the Illumina 550K v3 arrays, with 487 723 (~96.6%) of those SNPs having associated genotypes in the HapMap dataset. On the Illumina 450S Duo arrays, we were able to probe 510 506 SNPs, with 493 495 (96.7%) of those SNPs having associated genotypes in the HapMap dataset. Finally, we were able to probe 440 729 SNPs on the Affymetrix 5.0 arrays, with 427 254 (~96.9%) of those SNPs having associated genotype information in the HapMap dataset. There were two replicate for each pool and platform.
To evaluate our multimarker method, we used all SNPs on the arrays filtering out those that could not generate pooling test statistic due to errors or insufficient data. Nevertheless, it has been found that there is an enrichment of false-positives due to genotyping error among the most significant SNPs when individually genotyping (Hua
et al.,
2007). In order to accurately and fairly assess our approach, it is necessary to remove these false positives due to genotyping error. In other words, simply because a SNP is identified as the single most associated SNP by individual genotyping, this does not mean that this result is not due to a calling problem, copy number variant or an assay problem. While there is no perfect method to screen out SNPs that give rise to false positives, the most accepted approach is a series of filters. Thus, only SNPs passing the following filters were used in successive order for evaluation:
- All SNPs that had an individual genotyping minor allele frequency >0.05.
- All SNPs that had less than two no calls in both case and control pools, respectively, with HapMap genotypes.
- All SNPs that when tested for Hardy–Weinberg equilibrium with a χ2-test had a P−value ≥ 0.01 across cohorts.
- Only autosomal SNPs were used.
- All SNPs that had at least one other SNP in LD with value of R2≥0.8.
- All SNPs that had genotype data in the HapMap.
We define the true rank of a SNP to be the rank of the SNP according to the Fisher's exact
P-value from individual genotype data. Additionally, we define the top X truly associated SNPs as those SNPs are in the top X inclusive when ranked. We adopt these filters because the remaining SNPs allow us to better assess the performance of our method. A detailed explanation of these filters can be found in the
Supplementary Results. From these filters, we are left with 139 202 SNPs (~29.1%) for the Illumina 550K v3 arrays, 87 678 SNPs (~27.6%) for the Illumina 450S Duo arrays and 194 074 SNPs (~44.0%) for the Affymetrix 5.0 arrays, with the overwhelming majority filtered by the fifth criteria in all three cases.
For the analysis of Illumina 550K data alone, using previous individual genotype data, we were able to correct for preferential amplification during the PCR process for the Illumina arrays. This was done through a traditional
k-correction factor (Hoogendoorn
et al.,
2000; Le Hellard
et al.,
2002). This type of correction can significantly reduce biases in alleles between true and observed allelic frequency. Nevertheless, for both the Illumina 450S Duo analysis and the Affymetrix 5.0 analysis there was no previous individual genotype data associated with the version of arrays used. Additionally, when combining platforms,
k-correction was not used. It is also interesting to note that because the HapMap CEU individuals are composed of trios, the number of independent chromosomes per trio is four instead of six for unrelated individuals, which may cause our variance estimates to be less accurate than when using unrelated individuals.
3.1 Analysis improvement by a multimarker statistic
comparison between single marker (SA) and multimarker (MM) analysis of a pooled GWA study is shown in and
Supplementary Figures 1 and 2 under several scenarios.
Supplementary Figures 1 and 2 show the multimarker analysis for Affymetrix 5.0 arrays and Illumina arrays, respectively, and considers the trade-off between restricting our analysis to a single platform (alone) and combining platforms (combined). To analyze the data on Illumina platform, we combined the data from Illumina 550K v3 arrays and the Illumina 450S Duo arrays for a total of 1 015 110 SNPs before filtering, and 309 688 SNPs (30.5%) after filtering. shows a combined multimarker analysis when data is merged from the three different microarrays. When combining the Illumina 550K v3 arrays, Illumina 450S Duo arrays and Affymetrix 5.0 arrays, the total number of SNPs before filtering was 1 333 631 SNPs and 560 202 SNPs (~42.0%) after filtering. We completed the same analysis within each figure (A-E). There are various methods by which one could evaluate performance of a multimarker statistic, and this choice is largely arbitrary. We observe that our test statistic presented may not follow a chi-square distribution and therefore we choose a rank-based evaluation (Spearman rank correlation). We also wish to perform an evaluation on how a researcher would use the data and so we focus on the number of true associations identified from individual genotyping that would be carried forward in a two-stage design. We define the analysis of our initial pooling test statistic as single marker analysis (SA) and our multimarker test statistic as multimarker analysis (MM).
3.1.1 Evaluation metric 1—identification of the most associated SNPs within a two-stage design [A B] Within a GWA study, typically the first objective of the researcher is to identify those SNPs exhibiting the largest change in allelic association. Typically, and especially with two-stage GWA designs, a somewhat arbitrary number of SNPs are taken forward for individual genotyping in order to accurately calculate significance, reducing the dataset from 500K+to a few hundred or few thousand. Suppose we wish to carry forward as little as 100 and at most 5000 SNPs for validation. Therefore, it is important to consider what percentage of the true associated SNPs that are observed in the set of SNPs carried forward. For this analysis, we consider the top 100 truly associated SNPs. In A and
Supplementary Figures 1 and 2, we plot for a given observed rank threshold (the number of SNPs to be carried forward), the percentage of SNPs that were observed to be in the observed rank threshold (
x-axis) and were a top 100 truly associated SNP. We plot this percentage for both the Affymetrix and Illumina platforms, respectively, and also for both the single marker and multimarker analysis, respectively. In B, we simply look at the difference, or improvement, in the percentages from observing the single marker ranks to observing the multimarker ranks.
3.1.2 Evaluation metric 2—rank correlation [A and B] Another measure of significance is the correlation between the true ranks and our observed single marker or multimarker ranks. In C and
Supplementary Figures 1 and 2, we plot the Spearman rank correlation between the true and observed ranks considering the SNPs within the true rank threshold (
x-axis). We see in D the improvement in correlation between the single marker ranks and the multimarker ranks.
3.1.3 Evaluation metric 3—identification of top SNPs and directionality of change [E] In E, we plot the percentage of SNPs that fall within one of two criteria: either the SNP was both observed in the top 100 and within the true rank threshold (x-axis) or the SNP was moved in the correct direction by the multimarker analysis. A SNP moves in the correct direction if the multimarker rank is closer to the true rank than the single marker rank. Our main goal is to improve the correspondence between the true and observed ranks and thus an improvement in the observed ranks should be found.
We clearly see in and
Supplementary Figures 2 and 2 that the multimarker rank improves on the single marker under all scenarios. As an example (see
Supplementary Table 2), consider the single marker and multimarker ranks for the top 100 truly associated SNPs on the Illumina platform when considering just the 450S Duo and 550K platforms (alone) and when considering all three microarray types (combined). We notice that the improvement is greater on the Affymetrix platform, which is expected since there is greater noise and the number of probes per SNP is fewer than on the Illumina platform. The improvement is significant since using the multimarker method [B] we potentially increase the number of the top 100 truly associated SNPs carried forward by 5–35% depending on the number of observed SNPs to be carried forward for validation. Furthermore, if we combine the information from all platforms, we can include 100% of the top 100 truly associated SNPs by carrying forward the top 2500 observed SNPs. We can include 90% of the top 100 truly associated SNPs by carrying forward the top 1000 observed SNPs. We see the correlation between the true ranks and the observed ranks is higher in the multimarker analysis [C and D], and we verify this in E by seeing the directionality of that change. Additionally, in analyzing
Supplementary Figures 1 and 2, we see there is an improvement in combining both the Affymetrix and Illumina data versus considering them separately, clearly suggesting that our method improves further when data from multiple platforms are combined.
3.2 Simulation
We performed a simulation of a pooling study using pools composed by random sampling individuals from the 1958 Control Cohort of the Wellcome Trust dataset [The Wellcome Trust Case Control Consortium, 2007]. Ignoring duplicates, relatives and other data anomalies left a total of 1423 individuals. The genotype calls for these individuals were provided from the WTCCC and were previously genotyped on the Affymetrix 500K platform. Using this dataset, we simulated the Affymetrix 5.0 arrays by using four probes per SNP and by adding a mean zero error with variance 0.006 to the value of each probe. The probe variance of 0.006 matches our observed probe variance for Affymetrix 5.0 arrays. Our simulated study design consisted of pools of one hundred cases and one hundred controls with four replicate arrays for each cohort using a total number of 500 567 SNPs. Similar to our experimental analysis, we assumed a correlation structure to that of the HapMap and used the correlation structure (LD) found in the HapMap as input to our method. We evaluated the simulation results using the same metrics used to evaluate our experimental results. Unlike in the experimental results, the correlation structure used in the simulations was not directly trained from the WTCCC data but instead from the HapMap. Nevertheless, the results showed almost exactly the same improvements as the empirical results, including a noticeable increase in Spearman rank correlation for our multimarker method over the single marker method (figures omitted). In particular, we found an increase of 5–35% of true associated SNPs would be carried forward for validation if we were to go from 100 to 5000 SNPs for validation, which is precisely the result observed experimentally.
3.3 Imputation
To test the efficacy of our imputation method, we performed the same experimental analysis described above, including a list of SNPs to be imputed. We imputed only SNPs from the HapMap that had an LD value of at least 0.8 with an observed SNP on one of the two Illumina platforms (450S Duo and 550K v3). Additionally, we used the same filters previously stated and only used observed SNPs to impute if the observed SNPs had a true rank from genotypes as described above. In
Supplementary Figure 3, we see Evaluation Metrics 1 and 2 in A and C. We compare four methods, a baseline where we perform no imputation as in the previous analyses and three methods where the minimum number of proxies required for imputation is one, two and three, respectively. For at least one, two and three, minimum required proxies we imputed were 748 348, 544 041 and 424 314 SNPs, respectively. As the minimum numbers of proxies required are increased, our imputation method performs slightly better and approaches the results from if we directly measure the given SNP [see
Supplementary Fig. 3A and C]. This is expected since the SNPs imputed are in strong LD with our unobserved SNP and the number of indirect observations increases as the minimum number of proxies required increases. For some imputed SNPs in stretches of high LD, we see over a hundred proxies, opening the possibility to a great deal of information to be recovered as well as a problem of overfitting. Nevertheless, the imputation method achieves a high rank correlation between the true ranks and imputed ranks, and maintains a large number of truly associated SNPs within the rank threshold.