Although the simulation results for GM in appear very promising, we focus attention for the remainder of this report on TPM and RTP in order to address some fundamental issues of interpretation. In and the extended results described previously, RTP exhibits higher power than TPM. However, there are important differences in the way the null hypothesis of RTP and TPM are formulated, and the claims one can make upon the rejection of the null. The claim of the TPM is that upon rejection of the null, there are one or more TAs among the effects represented by *p*-values that are at least as small as the truncation threshold, *τ*. We are careful not to assign a degree of certainty or probability to this event, and still give the corresponding *p*-value a standard frequentist interpretation. On the other hand, one cannot make a similar claim with the RTP: i.e. that there is at least one TA among the effects represented by the first *K p*-values.

An intuitive explanation for this phenomenon is as follows. The requirement that the TA *p*-values should rank above *K* would be satisfied more often when the FA *p*-values happened to be unusually small. Thus, if we evaluate the proportion of rejections among experiments where all TAs rank above the *K*, we would expect it to be above the nominal level. However, there is no similar dependency among the TAs and the FAs for the TPM with any truncation threshold *τ* – the event that all TA *p*-values are larger than *τ* simply has an effect of increasing *L* without affecting the value of the *p*-value product.

To verify these assertions, we conducted the following additional simulations. To evaluate RTP under the above scenario, we generated 5 true effects for each simulation experiment with *L*=25 and individual power 70% for a TA, under the condition that the first 5 smallest *p*-values represent FAs. We rejected simulated samples where this was not the case. Next, we took the product of the first *K*=5 *p*-values and obtained the computed *p*^{c}. In these simulations, RTP still had 30% power to reject the hypothesis that there are no true effects among the total of *L* hypotheses. Taking a smaller value of *K*=3 still gave 21% power. The power increased to 47% for *K*=10. An analogous simulation for the TPM involved *L*=100, and 10 TAs with 99% power each. In these simulations, experiments with the TA *p*-values stronger than 0.05 were rejected. As expected, the proportion of rejections for the TPM with *τ*=0.05 was conservative: 2.9% at the level *α*=5% and 6.5% at *α*=10%. However, TPM gained power when the truncation threshold was increased. The power was 29% and 37% for *τ* =0.1 and *τ*=0.2, correspondingly (at *α*=5%).

It is possible to modify TPM for interpretation on a narrower rejection set, and

Appendix 1 describes details of one possible approach. This is reminiscent of an issue with Benjamini and Hochberg’s [

16] false discovery rate (FDR) where FDR =

*E*[

*F*/(

*T*+

*F*) |

*T*+

*F* > 0] Pr(

*T*+

*F* > 0), denoting True and False discoveries by

*T* and

*F*. The difference with TPM is that we don’t necessarily know what Pr(

*T*+

*F* > 0) is, and in fact this probability can decrease to

*α* as

*L* increases. This would happen in settings such as association genome-wide scans, where most added hypotheses are FAs, so that the prior proportion of TAs would approach zero. In this case,

*E*[

*F*/(

*T*+

*F*) |

*T*+

*F* > 0] approaches 1, and Pr(

*T*+

*F* > 0) approaches

*α*. Then across the sets of

*L* experiments, FDR rejects

*α*% of the time, as it should, because it becomes the Simes test which is known to maintain “weak control” of the FWER. The value of Pr(

*T*+

*F* > 0) depends on power and also on the prior Pr(H

_{0}) (in other words, on the population proportion of FAs). Thus, we cannot easily obtain the FDR conditional on having made rejections,

*E*[

*F*/(

*T*+

*F*) |

*T*+

*F* > 0], which would be interpreted as the “proportion of false rejections among rejections”. Nevertheless, in certain settings such as microarray expression experiments, where the proportion of TAs is large and does not decrease with

*L*, Pr(

*T*+

*F* > 0) may approach 1 and the conditional interpretation of FDR is warranted.

Care must also be taken in interpreting the value of

*K* used with RTP. Closely related to RTP is the Set Association Method by Hoh et al. [

17] proposed for large-scale genome association experiments. The set association approach is to take the sum of the first

*K* largest association test statistics,

, to form an overall test that includes the

*K* most significant genetic markers. Hoh et al. proposed finding the value of

*K* that maximizes the significance of

*S*_{i}, obtained from its Monte Carlo permutation distribution. While the method results in a powerful test under models with multiple contributing factors, there is a temptation to interpret the value of

*K* that corresponds to the minimum permutational

*p*-value for the combined test statistic (

*S*_{K}) as being an estimate of the number of true associations. For example, in studying association of glucocorticoid-related genes with Alzheimer’s disease, de Quervain et al. [

18] interpret the optimal value of

*K* as an estimate of the number of TAs and write that the additional markers “indicate SNPs not contributing to the disease risk significantly, i.e.

*p*-value is increasing due to introduction of statistical noise”. However, such an estimate is biased, for the reason that true associations tend to spread over a number of individual test results that are ordered by significance. When

*m* is much smaller than

*L*, and the power corresponding to TAs is low, then TAs tend to be interspersed with the first order statistics of FAs. Up to a certain value of

*i*, these first order statistics (

*X*_{i}) will have

*p*-value distributions which are more skewed toward zero compared to those of TAs, given that the value of

*L* is sufficiently large. Therefore the set association method is likely to reach its highest power for values of

*K* >

*m*. If individual test statistics are independent and follow a chi-square distribution, finding the

*p*-value associated with

*S*_{k} is equivalent to computing Pr(

*W*_{K} ≤

*w*), where

,

*w* is the observed value for the product, and

*P*_{i} are ordered random

*p*-values corresponding to individual association tests. Thus, the method is equivalent to the RTP method discussed above.

Appendix 2 provides an analytical derivation of the product distribution and further theoretical justification for the preceding bias claims.