Statistical tests have become more and more important in medical research [
1-
3], but many publications have been reported to contain serious statistical errors [
4-
10]. In this regard, violation of distributional assumptions has been identified as one of the most common problems: According to Olsen [
9], a frequent error is to use statistical tests that assume a normal distribution on data that are actually skewed. With small samples, Neville et al. [
10] considered the use of parametric tests erroneous unless a test for normality had been conducted before. Similarly, Strasak et al. [
7] criticized that contributors to medical journals often failed to examine and report that assumptions had been met when conducting Student’s
t test.
Probably one of the most popular research questions is whether two independent samples differ from each other. Altman, for example, stated that “most clinical trials yield data of this type, as do observational studies comparing different groups of subjects” ([
11], p. 191). In Student’s
t test, the expectations of two populations are compared. The test assumes independent sampling from normal distributions with equal variance. If these assumptions are met and the null hypothesis of equal population means holds true, the test statistic
T follows a
t distribution with
n X
+
n Y – 2 degrees of freedom:
where
m X and
m Y are the observed sample means,
n X and
n Y are the sample sizes of the two groups, and
s is an estimate of the common standard deviation. If the assumptions are violated,
T is compared with the wrong reference distribution, which may result in a deviation of the actual Type I error from the nominal significance level [
12,
13], in a loss of power relative to other tests developed for similar problems [
14], or both. In medical research, normally distributed data are the exception rather than the rule [
15,
16]. In such situations, the use of parametric methods is discouraged, and nonparametric tests (which are also referred to as distribution-free tests) such as the two-sample Mann–Whitney
U test are recommended instead [
11,
17].
Guidelines for contributions to medical journals emphasize the importance of distributional assumptions [
18,
19]. Sometimes, special recommendations are provided. When addressing the question of how to compare changes from baseline in randomized clinical trials if data do not follow a normal distribution, Vickers, for example, concluded that such data are best analyzed with analysis of covariance [
20]. In clinical trials, a detailed description of the statistical analysis is mandatory [
21]. This description requires good knowledge about the clinical endpoints, which is often limited. Researchers, therefore, tend to specify alternative statistical procedures in case the underlying assumptions are not satisfied (e.g., [
22]). For the
t test, Livingston [
23] presented a list of conditions that must be considered (e.g., normal distribution, equal variances, etc.). Consequently, some researchers routinely check if their data fulfill the assumptions and change the analysis method if they do not (for a review, see [
24]).
In a preliminary test, a specific assumption is checked; the outcome of the pretest then determines which method should be used for assessing the main hypothesis [
25-
28]. For the paired
t test, Freidlin et al. ([
29], p. 887) referred to as “a natural adaptive procedure (…) to first apply the Shapiro-Wilk test to the differences: if normality is accepted, the
t test is used; otherwise the Wilcoxon signed ranked test is used.” Similar two-stage procedures including a preliminary test for normality are common for two-sample
t tests [
30,
31]. Therefore, conventional statistical practice for comparing continuous outcomes from two independent samples is to use a pretest for normality (H
0: “The true distribution is normal” against H
1: “The true distribution is non-normal”) at significance level
α pre before testing the main hypothesis. If the pretest is not significant, the statistic
T is used to test the main hypothesis of equal population means at significance level
α . If the pretest is significant, Mann-Whitney’s
U test may be applied to compare the two groups. Such a two-stage procedure ( Additional file
1) appears logical, and goodness-of-fit tests for normality are frequently reported in articles [
32-
35].
Some authors have recently warned against preliminary testing [
24,
36-
45]. First of all, theoretical drawbacks exist with regard to the preliminary testing of assumptions. The basic difficulty of a typical pretest is that the desired result is often the acceptance of the null hypothesis. In practice, the conclusion about the validity of, for example, the normality assumption is then implicit rather than explicit: Because insufficient evidence exists to reject normality, normality will be considered true. In this context, Schucany and Ng [
41] speak about a “logical problem”. Further critiques of preliminary testing focused on the fact that assumptions refer to characteristics of populations and not to characteristics of samples. In particular, small to moderate sample sizes do not guarantee matching of the sample distribution with the population distribution. For example, Altman ([
11], Figure 4.7, p. 60) showed that even sample sizes of 50 taken from a normal distribution may look non-normal. Second, some preliminary tests are accompanied by their own underlying assumptions, raising the question of whether these assumptions also need to be examined. In addition, even if the preliminary test indicates that the tested assumption does not hold, the actual test of interest may still be robust to violations of this assumption. Finally, preliminary tests are usually applied to the same data as the subsequent test, which may result in uncontrolled error rates. For the one-sample
t test, Schucany and Ng [
41] conducted a simulation study of the consequences of the two-stage selection procedure including a preliminary test for normality. Data were sampled from normal, uniform, exponential, and Cauchy populations. The authors estimated the Type I error rate of the one-sample
t test, given that the sample had passed the Shapiro-Wilk test for normality with a
p value greater than
α pre. For exponentially distributed data, the conditional Type I error rate of the main test turned out to be strikingly above the nominal significance level and even increased with sample size. For two-sample tests, Zimmerman [
42-
45] addressed the question of how the Type I error and power are modified if a researcher’s choice of test (i.e.,
t test for equal versus unequal variances) is based on sample statistics of variance homogeneity. Zimmerman concluded that choosing the pooled or separate variance version of the
t test solely on the inspection of the sample data does neither maintain the significance level nor protect the power of the procedure. Rasch et al. [
39] assessed the statistical properties of a three-stage procedure including testing for normality and for homogeneity of the variances. The authors concluded that assumptions underlying the two-sample
t test should not be pre-tested because “pre-testing leads to unknown final Type I and Type II risks if the respective statistical tests are performed using the same set of observations”. Interestingly, none of the studies cited above explicitly addressed the unconditional error rates of the two-stage procedure as a whole. The studies rather focused on the conditional error rates, that is, the Type I and Type II error of single arms of the two-stage procedure.
In the present study, we investigated the statistical properties of Student’s
t test and Mann-Whitney’s
U test for comparing two independent groups with different selection procedures. Similar to Schucany and Ng [
41], the tests to be applied were chosen depending on the results of the preliminary Shapiro-Wilk tests for normality of the two samples involved. We thereby obtained an estimate of the conditional Type I error rates for samples that were classified as normal although the underlying populations were in fact non-normal, and vice-versa. This probability reflects the error rate researchers may face with respect to the main hypothesis if they mistakenly believe the normality assumption to be satisfied or violated. If, in addition, the power of the preliminary Shapiro-Wilk test is taken into account, the potential impact of the entire two-stage procedure on the overall Type I error rate and power can be directly estimated.