For the first 8 methods on the 11 datasets, we record their run time, the sensitivity at FPR=0.5%, ROC curves, and AUC (not necessary but just give more details). Because of the large computational complexity of PLR (spent 2 days to select top 30 SNPs in the 100-SNP data), we only record its sensitivity at FPR=0.5% on the 100-SNP data.
The parameters are set as close as possible to the default settings in [4
] or in the related software whenever possible, and we only modify a few parameters in order to control the computational cost and performance within a reasonable range. When testing MDR on 1000-SNP data, we used the heuristic search (100,000 evaluations) instead of exhaustive search because of limited memory and high computational complexity. When testing BEAM, we enlarged the number of Markov chains from 1 to 10, and increased the number of initial tries from 20 to 40 in order to obtain a more complete result.
For statistical filtering methods (chi-2 test, FET and LR), we only detect the marginal effects of each SNP. For FIM, IG and PLR, we detect up to 2-way SNP interactions with exhaustive search. And for MDR and SH, we detect up to 3-way SNP interactions. The run time (including time for loading data) used by the 9 methods to output the full rank of SNPs is recorded in . All the experiments are run on a desktop with 3GHz CPU and 2GB RAM, OS: WinXP.
Run time (sec) of 9 methods on 100-SNP data and 1000-SNP data. The time for MDR on 1000-SNP data is estimated for exhaustive search.
The ROC curves for 100-SNP and 1000-SNP data in the 1st dataset are shown in and . ROC curves and AUCs are marked by different color shown in the lower right corner of the figures.
ROCs of 8 methods on 100-SNP data of set 1
ROCs of 8 methods on 1000-SNP data of set 1
For 100-SNP and 1000-SNP data in all of the 11 datasets, AUC and the sensitivity when FPR=0.5% are shown in ~. The last row of each table records the mean value of the performance measures across all datasets.
Sensitivity when FPR=0.5% of 9 methods on 100-SNP data
AUC of 8 methods on 1000-SNP data
From ~, ~, we can see that:
Although the global performance (AUCs) of these 9 methods are fairly good, the sensitivities at low FPR are not satisfactory, i.e. only a small part of ground truth SNPs is selected by the 9 methods at FPR=0.5% in most datasets. As stated in section 2.3, the sensitivity with low FPR is more critical than AUCs, the results are actually unsatisfactory. This may be caused by the aforementioned challenges in detecting SNP epistasis effects, or by existing problems or limitations in these methods. For example, statistical filtering methods cannot detect SNP interactions well; chi-2 test and LR are based on asymptotic distributions so that large sample size is required; FET is an exact test but assumes either the dominant or recessive genetic model; FIM has high degrees of freedom, which also requires large sample size and degrades the accuracy of statistical significance; the cross-validation prediction error of multifactor model (MDR) is not sensitive enough to differentiate ground truth SNPs from other SNPs, and etc.
Comparing the relative performance of the 9 methods, we can see that although statistical filtering methods (FET, chi-2, LR) can only detect marginal effects of SNPs, the performance of these methods is relatively good. The methods considering SNP interactions (FIM, IG, MDR, SH, BEAM, PLR) do not gain significantly better sensitivity and ROC compared to statistical filtering methods. From ~, we can see that IG, FIM, BEAM have similar but slightly lower mean sensitivity and mean AUC than FET, chi-2 and LR, while MDR and SH have apparently weaker performance. Only PLR has slightly (but not consistently) better sensitivity on 100-SNP data, but its high computational complexity prevents us from further evaluating it on 1000-SNP data.
By looking into the SNP interactions selected by FIM, IG, MDR, SH, BEAM, and PLR, we found that they indeed detect some ground truth SNP interactions which cannot be detected by statistical filtering methods. These methods have their merits by considering the interaction effects, and taking account of nonlinearity and complexity of the SNP interactions. Moreover, some of the methods, such as SH, have explicitly devised principles to detect the SNP interactions with insignificant marginal effects. An interesting observation is: although some of the ground truth SNP interactions are detected, many false positives are also highly ranked by these methods, and the false positive SNPs are usually mistakenly selected as interacting with the ground truth SNPs having strong marginal effects. Therefore, the performance of these methods is heavily weakened by the false positives. The criterion functions of these methods are not sensitive enough to differentiate ground truth SNPs from false positives, due to various reasons, e.g. multiple testing, high degrees of freedom, and the improper underlying model.
As to computational complexity, the run time of statistical filtering methods is quite modest on the 100-SNP data and goes up linearly on the 1000-SNP data. For methods considering interactions, the run time goes up quickly. Among them, SH is the fastest but its efficiency is obtained by paying the price of SNP discovery accuracy. For IG, FIT and BEAM, the computation increases at least quadratically with the number of SNPs. We are questioning their applicability for the real scenario, since with about 500k SNPs, it is estimated to take at least (for IG) 300 days. For MDR and PLR, the computation burden is even heavier, which makes their applicability unrealistic on large datasets. Certainly, the algorithms considering the interactions are still useful and applicable for candidate gene based association studies, alternatively, they can serve as post processing methods after pre-screening the whole SNPs into a manageable size using some other methods such as the statistical filtering methods.