To further evaluate the performance of the proposed testing procedures, we compared by simulation studies the average power of six multiple testing procedures: the Šidák procedure, the Bonferroni procedure, the weighted Šidák (Bonferroni) procedure using the optimal Šidák (Bonferroni) weights, and the GS Šidák (Bonferroni) procedure using the optimal Šidák (Bonferroni) weights.
3.1. Assuming true means
μ = (
μ1,
μ2,
, μm known
When we assume that the means of statistics are known, for each true null hypothesis Hj : μj = 0, the weight is assigned to zero in each procedure using weights. Thus, all true null hypotheses will not be rejected in these procedures. In other words, the FWER is equal to zero in the simulation studies with known means of statistics. Thus, we only compare the average power among these six procedures.
We simulated datasets in a similar way to
Rubin et al. (2006). Each simulated dataset
X = (
Xi,j)
n×m consisted of
n =100 i.i.d observations, where each observation corresponded to a subject and was a vector of measurements of
m = 1,000 independent covariates, and
Xi,j was a measurement of covariate
j at the
i-th observation. For each covariate
j, we assumed that
Xi,j ~
N(
γj, 1),
i = 1,

,
n, and we implemented a test with statistic

to test the null hypothesis
γj = 0 against the alternative
γj > 0. Using the central limit theorem,
Zj follows an asymptotic distribution of
N(
uj, 1), where

(
Rubin et al., 2006). Thus, testing hypothesis
γj = 0 is equivalent to testing hypothesis
Hj:
uj = 0.
When generating each dataset that was associated with 1,000 covariates, we randomly chose 50 covariates and set the means
γj > 0 for these 50 covariates (i.e.,
μj > 0 for the corresponding test statistic
Zj), and set
γj = 0 for the other 950 covariates. In our simulation studies, we considered two scenarios for the 50 nonzero
γj. In Scenario 1, we set the 50 non-zero
γj equal to a common value
γ that varies as a simulation parameter. We considered
γ between 0.1 and 0.5, in increments of 0.1 (correspondingly,

are between 1 and 5, in increments of 1). We simulated 1,000 datasets corresponding to each of these
γ values. In Scenario 2, for the 50 non-zero
γj, we set the first ten
γj = 0.1 (
μj = 1), the second ten
γj = 0.2 (
μj = 2),

, and the fifth ten
γj = 0.5 (
μj = 5). We simulated 1,000 datasets for Scenario 2.
shows the results of the estimated average power of the six multiple testing procedures in Scenarios 1 and 2. From , we can see that the GS Šidák, weighted Šidák, and Šidák procedures have slightly higher estimated average power than the corresponding GS Bonferroni, weighted Bonferroni, and Bonferroni procedures, and that the GS Šidák procedure is most powerful among the six procedures. The GS Šidák procedure and the GS Bonferroni procedure can have much higher power than both the weighted Šidák procedure and the weighted Bonferroni procedure. For example, in Scenario 1, when μ = 3, the estimated average power of the GS Šidák procedure, GS Bonferroni procedure, the weighted Šidák procedure, and the weighted Bonferroni procedure is 0.5820, 0.5792, 0.4670 and 0.4639, respectively (see ).
| Table 1.The estimated average power of the six multiple testing procedures over 1,000 replicated data sets when given the means of the test statistics. (α =0.05, m2 = 50, m =1000 and n =100). |
3.2. True means
μ = (
μ1,
μ2,
, μm) unknown
In previous sections, all weights are calculated under the assumption that the means
μ = (
μ1,
μ2,

,
μm) of statistics are known. However, in real data analysis, the means
μ are usually unknown. The means
μ can be estimated by using certain prior information. How to effectively estimate
μ is still a topic to be further investigated.
Rubin et al. (2006) described a data-splitting method, which splits the full data
X into two parts,
X1 and
X2, with proportion π and 1- π of the original data
X, respectively. Data
X1 is used to estimate the means of the standardized test statistics and data
X2 is used to test the hypotheses. They showed that if some prior information, such as the order of means (
μ1,
μ2,

,
μm), is available, by using the data-splitting method the weighted Bonferroni procedure has higher power than the Bonferroni procedure. In genetic association studies,
Roeder et al. (2007) described a two-step approach to estimate the means (
μ1,
μ2,

,
μm), by using prior information such as reported linkage peaks, results of previously genome wide association studies, or results of gene expression studies.
It is beyond of the scope of this study to determine how to effectively estimate the means (
μ1,
μ2,

,
μm) by using prior information. To show the performance of our proposed procedures when estimated means are used, as an example, we implemented our proposed procedures by incorporating the data-splitting method and applied these methods to the simulated Scenarios 1–2 data sets described in the previous section. The only exception is that we assume here that the first 950 covariates have means equal to zero and the last 50 covariates have the common mean value
γ >0. This fellows the assumption of
Rubin et al. (2006) that the order of means (
μ1,
μ2,

,
μm) is known.
For each simulated dataset, the Bonferroni procedure and the Šidák procedure were implemented on the entire dataset, while the other four procedures used the data-splitting method (under the assumption that the order of means (
μ1,
μ2,

,
μm) is known). shows the estimated average power and family-wise error rates of the six procedures for Scenarios 1 and 2. We only show the results with the proportion π of the first part
X1 equal to 0.1.
| Table 2.The estimated FWERs and average power of the six multiple procedures over 1,000 replicated data sets when the means of the test statistics are unknown (α = 0.05, m = 1000, m2 = 50, n=100, and π = 0.1 for data-splitting). |
From , we can find that the weighted Šidák and the GS Šidák procedures have slightly higher estimated average power than the weighted Bonferroni and the GS Bonferroni procedures, respectively, and that the GS Šidák procedure has the highest estimated average power among these six procedures. For example, when μ is equal to 4, the estimated average power of the GS Šidák procedure is 0.8719. It is nearly 13% more than that of the weighted Bonferroni procedure (0.7427). In addition, it is interesting that the estimated average power of the six procedures is smaller than their estimated FWERs when μ is equal to 1. This occurs because the average power is the average (not cumulative value) of per-hypothesis powers for the 50 false null hypotheses, and the FWER is a cumulative value (not average) of type I error rates for 950 tests.
From , we can also find that the six procedures can control FWERs quite well. Interestingly, the estimated FWERs are much lower in the four procedures using weights (i.e. the weighted Bonferroni, weighted Šidák, GS Bonferroni and GS Šidák) than in the two procedures without using weights (Bonferroni and Šidák). The reason is that the four weighted procedures used the prior information of the order of means of the test statistics.