The use of controlling FDR based methods for multitest adjustment has implied an obvious improvement by increasing the statistical power in families of comparisons [
26]. However, such an improvement is far from being useful for experimentalists under all the circumstances, in particular when a relatively small sample size and a high number of comparisons are involved. In such conditions, classical multitest adjustments are known to have low statistical power [
16]. In fact, a great number of controlling FDR based methods have been proposed trying to further improve its applicability although with moderate results [
11,
14-
17]. Here we suggest a completely different approach, by using a sequential goodness of fit on the set of comparisons, which may help in some of the circumstances where the FDR-based approaches fail to find true discoveries. Similar to Bonferroni techniques, the SGoF metatest controls for FWER (FamilyWise Error Rate). Given a number
S of tests, in Bonferroni technique the error rate per comparison is fixed to α/S. Therefore, this value diminishes as the number
S of tests is higher. The problem is that the power to detect true discoveries also depends on this error rate. With a very stringent significance level we will have very low power. Importantly, in the case of SGoF, the per test error rate is proportional to α/
S by a factor that increases with the number of tests resolving in this way the trade-off between type I error and statistical power. Therefore, the power increases with the number of tests though the family wise error rate is being controlled to avoid a high false discovery rate. As far as we know, there are not other multitest adjustment methods with this desirable property of increasing power with the number of tests. However, as can be expected from equation (2) and can be seen in Figure such increase is not lineal. Therefore, it could be suggested that increasing the number of tests up to 1000 or 10,000 will increase considerably the statistical power of the SGoF adjustment, but above 10,000 the increase will become slighter (Figure ). This suggests that using more than 10,000 comparisons could not offer a clear advantage.
Another issue concerning the statistical power of SGoF seems to be the percentage of tests in which the alternative is true (% effect) which has a clear impact onto the discoveries rates. In the case of SB and BH this impact is more difficult to follow from the tables because there is a trade-off with the increasing number of tests (which reduces the power). However with SGoF the effect is very clear because, as is expected from equation (2), both the % of effects (1-λ) and the number S of tests increase the power.
Obviously, because SGoF does not perform so stringent control neither on the per test error rate nor in the FDR, this implies that FDR is being allowed to be higher than with SB and BH methods. However, given a number K of observed significants, as power increases, FDR is expected to diminish because in such a case the proportion of true discoveries approaches 1. Therefore, SGoF will attain an indirect control of FDR with large numbers of tests and/or effects involved. That is, SGoF will behave especially well compared to the classical methods when the alternative is weak and both the number of effects through the family of tests and the number of tests involved are high. In this case, SGoF can be up to two orders of magnitude more powerful than the other methods, maintaining at the same time acceptable FDR values.
We have also observed that if the p-values are not correctly calculated the FDR will be uncontrolled as occurs with any other multitest adjustment method. This is noteworthy because empirical studies do not usually involve large sample sizes within each test. Known multitest adjustment methods can have very good asymptotic statistical properties. In fact, both SB and BH have very good power with the kind of tests assayed when the sample size is as large as 500 (not shown). The problem is that empirical science does not work on the asymptotic arena but on finite sample size. As we have seen, the assumption of controlled FDR fails when sample size is small, at least under one-sample t and homogeneity G-tests. Additionally, the classical adjustment methods (B, SB and BH) have low power when the number of tests is high and/or the effects are weak. Indeed in these conditions, SGoF should be considered as an interesting method to detect that some kind of true effect exists though we are not confident in that all detected positives are true discoveries. In addition, some uncertainty exists when significant probabilities have exactly the same values. For example, if 9 out of 10 comparisons have a p-value below 0.05, say 0.049, SGoF will show that 8 can not be explained by chance, but the researcher has no way to choose among the 9. On the other side alternative multitest adjustment methods (BH or others) cannot find any significant case. Nevertheless, from an experimentalist point of view, it will be more useful to know that at least 8 hypotheses deserve more detailed studies than to just ignore all of them. In cases like this, under the SGoF method, the 8 significant tests will be chosen randomly from the 9 available.
Concerning statistical properties as conservativeness, sensitivity and specificity we have computed the degree of conservativeness [
27] and performed ROC analysis [
28] for the same cases as in Figure [see Additional file
2]. The results just confirm the good properties of SGoF as already expected from the higher per comparison error rates (see property 2) and the true and false discoveries numbers (see Figure ).
Another important topic concerning multiple hypothesis testing efforts applied to high-throughput experiments is the intrinsic inter-dependency in gene effects. We would like to note that correlation can have important effect onto FDR-based adjustment methods [
29]. However, it is usually considered a kind of dependence in gene effects called weak-dependence which corresponds to local effects between a small number of genes [
30]. It has been shown that under the assumption of the so-called weak-dependence, the FDR-based methods are still useful provided that the number of tests is large enough [
20,
30]. SGoF does not consider the p-values individually but the proportion of significant ones and this should make it more robust to dependence issues. Therefore, we expect at least the same or better performance for SGoF than for FDR-based methods when considering gene dependencies. Our preliminary results (not shown) indicate that dependence has no effect onto SGoF power provided that the blocks with correlated genes are small. Indeed with blocks as large as 100 genes and correlation as high as 0.9 the loss in power is small. Furthermore, short blocks of correlated genes is what is expected in genome and proteome wide studies [
20,
29]. Additionally, we have observed that if the blocks are short the magnitude of the correlation has a minor effect. Nevertheless, we think that such topic deserves further study.
Finally, we note that we have obtained
p-values via simulation from two kind of tests, one-sample t-test which is widely used, and also via homogeneity tests, that are also frequently involved in multiple comparisons [
31]. In addition, SGoF should be of general utility under other families of multiple comparisons, although this should deserve further investigation. The failure of classical multitest adjustments to deal with a huge number of tests (>1000) has been considered as a key problem in many omic technologies [
16], and so SGoF comes to contribute to a well-known need.