We compared the performance of FastEpistasis and PLINK epistasis tests for several sets of SNP pairs, using a single core to enable a fair comparison. FastEpistasis ran almost 15 times faster than PLINK, completing 81 376 epistasis tests per second compared to 5696 tests per second computed by PLINK (see

Supplementary Table 1). In the event that only SNP pair results below a

*P*-value threshold are needed, requiring a negligible time for post-computation, FastEpistasis computes about 120 000 epistasis tests per second, ~20 times faster than PLINK (also see below for output size effect in multiple phenotype analysis). However, the gain in performance depends on the number of individuals in the population as shown in and

Supplementary Figure 1. With the exception of

*Not A Number* PLINK output, all FastEpistasis results agree perfectly with PLINK.

| **Table 1.**Epistasis tests per second completed by FastEpistasis core computation phase for several population sizes, using eight cores |

The speed of FastEpistasis scales linearly with the number of processors at 93% asymptotical efficiency, using either SMP or MPI architecture (see

Supplementary Fig. 2). At this rate, the computational time required to test all pairs of 500 000 SNPs, totaling 125 billion tests, using a population of 5000 individuals is about 29, 4 or 0.5 days using 8, 64 or 512 MPI-bound processors.

FastEpistasis is capable of analyzing several different phenotypes simultaneously, using the same genotypes. By performing the QR decomposition of the covariate matrix once and applying the result to several phenotypes, the total number of computations is reduced compared to carrying out the computations separately for each phenotype. Although we observe a significant speed-up with multiple phenotypes, the performance reaches a peak and then collapses, and becomes a penalty as the number of phenotypes grows (

Supplementary Fig. 3). The problem occurs during the core-computation phase and is due to the size of the results. The processors are able to compute the test statistics faster than the results can be buffered and transferred to the hard drive. Completely omitting to output the results removes the performance collapse. The reduction in computational time analyzing several phenotypes simultaneously depends on several factors including the speed of the epistasis tests (which in turn depends on the number of individuals) and the number of results to be output. For example, using 8 processors, a population size of 171 MKK individuals, 10 phenotypes, and outputting all epistasis results, the computations are 1.06 times faster than analyzing each phenotype separately whereas outputting results for

*P* < 0.01, ~ 1% of tests, the computations are 4.77 times faster. Therefore, restricting the output to

*P*-values below a relatively small threshold, or increasing storage throughput using a striped disk RAID array, for example, can decrease computational demands when analyzing multiple phenotypes.