4.1 Design of simulation

4.1.1 Parameter configuration Data for our simulation study were generated by the LMM (1). We assume that the intercept terms α^{(k)} = 0 ∀*k*, an assumption which does not affect the general results presented here. We considered three specifications of the treatment effect **β** = (β^{(1)}, …, β^{(K)})′, following the LMM (1). The first specification of the treatment effect places an effect β^{(1)} > 0 on the first outcome, while the other effects are zero, β^{(k)} = 0 for *k* ≠ 1; for example, in the simulation presented here, *K* = 5 and **β** = (0.6, 0, 0, 0, 0)′. A second specification places a uniform treatment effect (also known as “common effect”) on all the outcomes β^{(j)} = β^{(k)} > 0 ∀*j* ≠ *k*; for example, β^{(k)} = 0.3 ∀*k*. The last specification places a non-constant, positive effect on all but one of the outcomes; for example, **β** = (0.6, 0.45, 0.3, 0.15, 0)′ for the third specification of the treatment effect **β** in the simulation study.

The covariance-variance structure for the multiple outcomes is determined by the random effect

*b*_{i} and independent error terms

, for which

*b*_{i} ~

*N*(0, ρ and

To facilitate discussion here, we consider the setting in which γ

^{(k)} = 1 ∀

*k*, in order to induce a constant pairwise correlation for all pairs of outcomes (compound symmetry); more complex correlation structures can be defined by changing the values of the scale parameters γ

^{(k)} for

*k* = 1, …,

*K* − 1, with γ

^{(K)} = 1. With the γ

^{(k)}'s set to unity,

*V ar*(

*Y*^{(k)}|

*X*, β

^{(k)}) = 1 ∀

*k*, so that the outcomes are all measured on the same scale, and the outcomes all have a constant pairwise correlation, ρ. The simulation parameter of most interest is ρ, which affects the power of the univariate and joint tests in different ways. While the assumption of compound symmetry for the correlation structure might not be tenable in real examples; we make this assumption here in the simulation in order to illuminate the meaningful influence of correlation on the characteristics of the univariate and joint tests, particularly under the different specifications of the treatment effect and under different patterns of missing data.

With the foregoing specification, 40 treated (*X* = 1) and 40 comparison (*X* = 0) subjects with *K* = 5 outcomes each were generated by the LMM (1). Other settings of sample size and number of outcomes were considered, but not shown in the results in §4.2. The relationship between correlation and testing characteristics were similar for different settings of these parameters. For example, an increase in sample size yielded greater power for the univariate and joint tests, all other things held equal, but the general trends and relationships did not change.

4.1.2 Missing completely at random When data are missing completely at random (MCAR), the probability that observation *Y*_{i}^{(k)} is missing is a constant π, for all subjects *i* and outcomes *k*, and the test procedures will generally suffer a loss of power, although to differing degrees. For example, when data are MCAR with probability π = 0.2 in a sample of 100 subjects, the Bonferroni procedure estimates the treatment effect on each of the *K* outcomes with an expected (1−π)100 = 80 observations. On the other hand, for *K* = 5 outcomes, say, Hotelling's *T*^{2} statistic uses only an expected (1 − π)^{5}100 = 33 complete observations, that is, 33 individuals whose 5 outcomes are all observed. Joint tests based on a joint model, such as the LMM (1) use all the available information, like the Bonferroni procedure, but also capitalize on the correlation of the outcomes so that they recover the loss of power when the outcomes are highly correlated.

4.1.3 Missing at random When data are missing at random (MAR), the probability that an observation is missing can depend on observed outcomes and treatment assignment. This dependency can induce bias in test procedures when it is ignored. Patterns of MAR missingness will lead to biased estimates of the treatment effect, thereby inflating type I error if it is not properly accounted for in the estimation.

Our MAR mechanism for the simulation study is specified by the following model. Let

be the vector of

*K* outcomes for subject

*i* and

*X*_{i} be the treatment assignment. Assume that the first outcome

is always observed, and the pattern of missingness on the other outcomes

, depends on

. Let

be the probability of missingness for each of these

*K* − 1 outcomes for subject

*i*, where

By this missingness model, the baseline probability of missingness is given by δ

_{0}; for ease of discussion, we assume that δ

_{1} = δ

_{2} = 0, so that missingness depends on the interaction between treatment and outcome

*Y*^{(1)} through δ

_{12} > 0. By this construction, the outcomes in the comparison group (

*X* = 0) are missing completely at random, while the outcomes in the treated group (

*X* = 1) are missing at random with probabilities that depend on the values of

*Y*^{(1)}. (When δ

_{12} = 0, then this setting reduces to the MCAR mechanism for both comparison and treated observations.) Without a correctly specified model for the multiple outcomes, such as the LMM (1), it is expected that tests based on data with this pattern of missingness will be biased, namely the Bonferroni and Hotelling's

*T*^{2} test procedures. In the missingness model as defined, treated subjects with larger values of

*Y*^{(1)} will have more outcomes missing when δ

_{12} > 0. When the treatment has effect only on

*Y*^{(1)}, this might pose little problem for univariate tests; however, when the correlation among outcomes is large, then the missingness will contribute to biased tests.

When the outcomes are jointly modelled, for example, by the LMM (1), then maximum likelihood estimates of the treatment effect can be obtained. In particular, when the model is correctly specified, then resulting estimates of the treatment effects β will be unbiased, and test procedures based on these estimates will maintain the nominal type I error rate.

4.2 Results

The power of the joint and univariate tests, in the presence of complete and MCAR data, are shown in for *K* = 5 outcomes. Simulation results for larger number of outcomes, for example, *K* = 10, are not presented here; such a presentation would have involved results that were not directly comparable across varying values of *K*, because the specifications of the treatment effect would not have had an obvious interpretation across values of *K*. However, it was noted that the results and conclusions from those simulations are similar to these presented here; namely, the power characteristics of the test procedures in relation to each other were the same for *K* = 5 versus *K* = 10 outcomes.

For two of the three configurations of the treatment effect, the 5-*df* joint test based on the LMM (1) outperforms the other testing procedures, especially in the presence of highly correlated outcomes; in fact, the power of the 5-*df* test approaches unity when the correlation is large, even when data are MCAR. Hotelling's *T*^{2} closely tracks the power of the 5-*df* test as well, when the data are completely observed. The small discrepancy in power between Hotelling's *T*^{2} and the 5-*df* joint test in the case of complete data is attributable to the correctly specified correlation model (compound symmetry) for the latter. In contrast to the 5-*df* test, the 1-*df* joint test has deflated power for the two settings in which the 5-*df* performs best; however, as expected, when the treatment effect is uniform on all the outcomes then the 1-*df* test produces the greatest power among the testing procedures considered here.

When the treatment has effect on one outcome (for example, β^{(1)} = 0.6 in the first row of ), the power of the Bonferroni procedure does not change with the strength of the correlation of the outcomes; that is, there appears to be no loss of power for the Bonferroni procedure when correlation increases. A basic explanation of this result is that the Bonferroni procedure likely depends on the test for β^{(1)} and not on the tests for the other outcomes, which do not exhibit a treatment effect under the alternative. However, when the treatment has an effect on more than one outcome, the Bonferroni procedure loses power as the correlation among the outcomes increases. This reinforces the earlier point made in §3.1 about the power characteristics of the Bonferroni procedure that depend on specification of the alternative hypothesis.

In the presence of data that are MCAR, similar patterns emerge for the relative power of the joint and univariate test procedures, with the exception of Hotelling's *T*^{2}. A subject who is missing one or more outcomes is not included in calculating the test statistic of Hotelling's *T*^{2}; thus, the Hotelling's *T*^{2}, based on a much smaller effective sample size, suffers a considerable loss of power when outcomes are missing. On the other hand, the univariate Bonferroni procedure uses all the observed information in constructing the separate *t*-tests; the joint tests based on maximum likelihood estimates from the LMM also maintain good power when data are MCAR (in fact, the power of the 5-*df* test approaches unity with increasing correlation). These joint tests capitalize on the correct specification of the model that generate the multivariate outcomes, and thus should yield the most powerful and also unbiased tests.

Further analysis (not shown here) of the relative power between situations in which data are missing and completely observed shows that relative power increases with correlation ρ; that is, if

*Power*_{m} is the power of a given test when data are missing and

*Power*_{c} is the power of the same test when data are completely observed, then the plot of

versus ρ has a positive slope. In other words, when data are missing, the test procedures suffer the most loss of power when the outcomes are weakly correlated, but lose little power under high correlation. For example, although the 5-

*df* test loses considerable power when data are missing under low correlation, it loses less power under high correlation; in fact, under MCAR data the power of the 5-

*df* test approaches unity with increasing ρ in two of the scenarios in . We provide a basic explanation for this in the setting of two outcomes: let

be the observed

*k*th outcome, for

*k* = 1, 2, for the

*i*th subject, and let the indicator

is observed and

if missing, with

. Let

be the sample mean of the complete observations for the

*k*th outcome and

for the incomplete observations. The covariance between sample means

*Cov* is a function of the covariances

*Cov* ; likewise, the covariance between sample means from incomplete data

*Cov* is a function of the covariance given by

The covariance of

and

is proportional to the covariance of the

and

by the factor (1 − π)

^{2}; in the side-by-side plots of , this means that the plots in the right column (missing data) can be viewed as rescaled, along the horizontal axis (ρ), versions of the plots on the left (complete data). Under no missing data and non-uniform treatment effects, correlation increases the power of some tests. Under data that are missing completely at random, an increase in correlation accounts for missing information; the benefit is particularly noticeable for the joint tests. However, the Bonferroni and Hotelling's procedures do not generally capitalize on the correlation when data are missing.

To summarize the results, the average power of the test procedures over the three configurations of the treatment effect is presented in the fourth row of ; in a sense, this plot of the average power is itself a scenario in which the treatment has a non-uniform effect on all the outcomes. It can be argued that this is the most likely situation in a real study, as opposed to a study which might believe the treatment has an effect on just one outcome or a uniform effect on all outcomes. The plots of the average power show that under mild correlation, the Bonferroni procedure maintains adequate power (as claimed in Sankoh et al. [

10]), compared to joint tests that were expected to capitalize on the correctly specified joint distribution of the multiple outcomes. It is rather clear, however, that when correlation is moderately large (ρ ≥ 0.4), one should consider the use of joint tests in order to maximize power. Notably, when data are missing, Hotelling's

*T*^{2} test should be avoided, because of the drastic loss of power.

When data are MAR according to the model (

2), our attention turns to the type I error characteristics of the test procedures. The plot of the type I error rate versus correlation among the outcomes is shown in ; in these plots, the parameters of the MAR model (

2) are set such that in the comparison group (

*X* = 0) the data are MCAR with π ≈ 0.2, while in the treated group (

*X* = 1) the odds ratio of missingness for outcomes

*Y*^{(k)},

*k* ≠ 1, under a unit increase in

*Y*^{(1)} is exp(δ

_{12}). In general the joint tests based on the LMM (1) are unbiased for all levels of correlation, while Hotelling's

*T*^{2} is the most sensitive to MAR conditions, particularly when the odds ratio of missingness exp(δ

_{12}) is large. The Bonferroni procedure is relatively unbiased for small exp(δ

_{12}); however, when exp(δ

_{12}) is large, the type I error of the Bonferroni procedure becomes inflated with increasing correlation. Both the Hotelling's

*T*^{2} and Bonferroni procedures tend to exclude observations with large values of

*Y*^{(1)}, such that the resulting estimates of the treatment effect are biased. While bias is of general concern for Hotelling's

*T*^{2} procedure, the Bonferroni procedure is relatively unbiased under weakly correlated outcomes (ρ ≤ 0.2); however, high correlation induces greater bias for the Bonferroni procedure when the degree of missingness due to

*Y*^{(1)} is large, that is, for large exp(δ

_{12}).