Have extensive simulations shown that INTs have certain desirable properties in the context of genetic studies?

We have been able to identify only a few simulation studies in the genetics context that addresses the utility of rank-based INTs. Specifically,

Wang and Huang (2002) developed a likelihood-based methodology in which a rank-based INT increased power. Their score-statistic approach to QTL mapping is asymptotically equivalent to the corresponding likelihood-ratio test and assumes normally distributed errors. In a series of simulations, the authors showed that with skewed (

) error distributions, the rank-based INT approach leads to noticeably greater power; however, Type 1 error control was inconsistent, especially with non-additive genetic models.

Peng et al. (2007) conducted a series of simulations of right-skewed (i.e., non-normal) phenotypes within pedigrees, performed

Amos' (1994) maximum likelihood (ML) variance components (VC) analysis on both untransformed and INT scores, specifically van der Waerden normal scores (

Eq. 1 with

*c* = 0), and compared them to a semiparametric QTL (SQTL) method (

Diao and Lin 2005). VC analyses applied to INT scores maintained valid Type 1 error rates, while Type 1 error rates from the untransformed data resulted in inflated Type 1 errors. The SQTL method, on the other hand, appeared overly conservative at α = 0.05 and less conservative at α = 0.001 when compared with the INT-transformed results. In preliminary simulations, we found similar results in the context of testing VC in a twin study with an ACE model (i.e., model with additive genetic, common environmental, and unique environmental effects).

Kraja et al. (2007) also found that the INT performed well in a large sample ML-based SNP association testing context. Therefore, the INT appears to perform nearly as well as if the true transformation to normality were known and applied, but it remains arguable whether the INT method applied to ML estimation of VC is uniformly preferable to the SQTL method.

Our review of the general literature on non-parametric statistics shows that use of INT scores in parametric tests sometimes yields superior performance compared to parametric tests with untransformed data or other non-parametric alternatives when the normality assumption is not met, but in other circumstances INTs can yield poor performance (e.g.,

Pratt 1964;

Keselman et al. 1977), especially when compared to other non-parametric alternatives (e.g.,

Knoke 1991). Because relative performance will vary by condition studied, it is important to study INTs in conditions that are similar to those used in modern complex trait genetic studies. Unfortunately, the specific circumstances under which INTs are currently being used often differ substantially from the conditions under which INTs and other rank-based approaches were studied in the past (in terms of both the tests that were studied and the α levels used) making the empirical work (e.g., simulations) of bygone days of lesser utility and suggesting the need to continue empirical evaluations in our present context.

Does use of INTs guarantee that tests assuming normality will have correct Type 1 error rates?

Because the marginal distribution of the sample phenotype is almost certain to be nearly normally distributed after an INT, it is tempting to conclude that this will guarantee a correct Type 1 error rate when the normality of residuals assumptions is violated. Indeed, in a recent *PLoS Genetics* article, Scuteri et al. wrote “To *ensure* adequate control of Type 1 error rates, we applied an inverse normal transformation to each trait prior to analysis.” [Emphasis added.] Yet, INTs do not necessarily fulfill the fundamental assumption of normally distributed residuals. It therefore seems a *fait accompli* that the use of INTs cannot guarantee that tests assuming normality will have correct Type 1 error rates on strictly theoretical grounds. Nevertheless, it might be conjectured that, in practice, INTs do offer approximately correct Type 1 error rates.

To explore this issue further, we conducted a number of simulations to illustrate key points. First, we investigate testing of the equality of two means for a continuous outcome (

*Y*) based on two independent random samples of

*n*_{1} =

*n*_{2} = 5 subjects. These groups could be, for example, treated and untreated subjects in a microarray study or wild-type and knock-out mice in a gene manipulation study, where such small sample sizes are quite common. We began by simulating the situation in which all assumptions of the

*t*-test were met and the null hypothesis was true (i.e., there was no mean group difference in the two populations sampled). As can be seen in the first two rows of , when the normality assumption is met and no transformation (

*Y*) is used, the correct Type 1 error rates are obtained as expected. In contrast, performing the parametric

*t*-test on the INT (Blom) scores produced significant inflation of the Type 1 error rate (i.e., too many false positives) at the α = 0.05 and 0.01 levels, whereas the Type 1 error rate at a nominal α = 0.001 level was zero. These discrepancies from the expected Type 1 error rates are due to the fact that the

*t*-test performed on INT scores produces

*P* values that are smaller than the

*P* values yielded by the permutation distribution of ranks. For α = 0.05, the problem of granularity of the permutation distribution of ranks does not subside until the samples are greater than 10 cases per group (

Bradley 1968).

| **Table 1**Empirical Type 1 error rates for tests of equality of two means |

The previous simulations clearly showed that INTs do not ensure adequate control of the Type 1 error rate in all situations. In response, one might argue that one would not use an INT when the normality assumption is met, so the aforementioned findings are irrelevant. However, one never knows unequivocally whether the normality assumption is met and assessing the fit of the sample data to the normality assumption is not very useful in small samples (

Farrell and Rogers-Stewart 2006). Nevertheless, it could be conjectured that INTs will help control Type 1 error rates when the normality assumption is violated. Therefore, we followed the approach of

Wang and Huang (2002) and simulated data from three distributions that were markedly non-normal. We simulated a one degree-of-freedom (

*df*) chi-square distribution (

), which has a skewness of 2.83 and kurtosis of 12, and another highly-skewed heavy-tailed distribution, Weibull (λ = 1;

*k* = 0.5), which has a skewness of 6.39 and kurtosis of 76. We also simulated a heavy-tailed (kurtosis of 3), but symmetric LaPlace distribution. shows that the

*t*-test does not make excess Type 1 errors with departures from the normality assumption (under the conditions we have simulated), but rather tends to be conservative. The INT approach as well as permutation tests performed on

*Y* (

Good 1999) and the ranks (

Wilcoxon 1945;

Mann and Whitney 1947) returns virtually the same Type 1 error rates as when the normality assumption was met. Again, the reason for these virtually identical results with the INT across distributions is due to the permutation distribution of ranks. If a permutation test were performed on the INT scores, it would have Type 1 error rates identical to the permutation test performed on ranks. Thus, even when the data are markedly non-normal, performing a parametric test on the INTs does not ensure correct Type 1 error rates, and thus provides no definitive advantage over the use of untransformed data or ranks in small samples.

Finally, one might argue that while the conclusions above may be relevant to small studies such as microarray or rodent knockout studies, however in larger studies, an INT will perform well. To address this question, we ran simulations with larger (though still modest) sample sizes of

*N* = 50 and 100 (25 and 50 subjects in each of two groups). With larger sample sizes, however, permutation testing becomes more difficult and one may assume that the

*t*-test for the untransformed scores (

*Y*) and approximate tests for ranks will yield valid Type 1 error rates. shows that even with these larger sample sizes, the parametric

*t*-test still has a suppressed Type 1 error rate, for asymmetric error distributions. The Type 1 error inflation for the INT scores due to the granular nature of the permutation distribution is eliminated. Therefore, the

*t*-test performed on INT scores has approximately the correct α level, but so too does the rank-based

Kruskal and Wallis (1952) approximate test. Thus, while an INT may hold the Type 1 error rate under the nominal alpha asymptotically, so too will the use of untransformed and rank data. Thus, there is no clear advantage to using the INT in terms of controlling the Type 1 error rate of OLS tests in one-way ANOVA-type designs.

To further examine this issue, we generalized our simulation to the one-way ANOVA with three groups. This analysis is extremely common in genetic association studies with a continuous phenotype (

*Y*) as the dependent variable and genotype as the grouping variable. One notable issue with applying the one-way ANOVA to a genetic association studies is that the sample sizes per genotype group are never expected to be equal in the population. So if one randomly samples from the population, the genotype groups would not be of equal size. For example, with a di-allelic marker with a minor allele frequency of

*P* = 0.5, the heterozygote genotype will have twice the sample size of the other two homozygote genotypes. This imbalance in the samples sizes worsens as the di-allelic frequencies deviate from 0.5. Unequal sample sizes in the one-way ANOVA are of particular importance because of the Behrens–Fisher problem. Specifically, the standard one-way ANOVA assumes that each group has the same population variance. The standard ANOVA

*F*-test is robust to heterogeneous variances, as long as the sample sizes are equal (e.g.,

Kohr and Games 1974;

Zimmerman 2004). It is well-known that the Type 1 error rate of the ANOVA

*F*-test is spuriously inflated or suppressed by unequal variances combined with unequal sample sizes. The

Welch (1947) separate-variances version of the

*F*-test, which does not use a pooled variance estimate and has modified

*dfs*, usually eliminates these effects. In fact,

Zimmerman (2004) concluded that optimum protection of the Type 1 error rate is assured by using the Welch test unconditionally whenever sample sizes are unequal. Although the exact distribution of the Welch statistic is known under normality (

Ray and Pitman 1961), it remains an approximate solution to the Behrens–Fisher problem. Monte Carlo studies have shown that the Welch approximate solution is not robust to departures from normality (e.g.,

James 1959;

Yuen 1974). Furthermore, solutions based on nonparametric or nonparametric-like procedures have been unsuccessful. For example,

Pratt (1964) showed that the Mann–Whitney U and the expected normal scores test (

Hájek and Sidák 1967) resulted in nonrobust Type I error rates.

Feir-Walsh and Toothaker (1974) and

Keselman, et al. (1977) found the Kruskal–Wallis test (

Kruskal and Wallis 1952) and expected normal scores test (

McSweeney and Penfield 1969) to be substantially affected by heterogeneity of variance.

To demonstrate these issues in the context of genetic association studies, suppose a continuous phenotype (

*Y*) that under the null hypothesis of no mean differences among the three genotypes is either sampled from either a normal error distribution or from one of the three markedly non-normal error distributions: LaPlace;

; or Weibull. There is one marker of interest, G. For this simulation, allele frequencies for G were set at

*P* = 0.5 and 0.25. Thus, for these two scenarios, the expected sample sizes for each genotype will differ substantially. For

*P* = 0.5, the homozygote (gg and GG) groups would be expected to have 25% of the total sample size (

*N*), while the heterozygote (Gg) group would twice as large with 50% of

*N*. For

*P* = 0.25, the homozygote (gg) group would be expected to have 56.25% of

*N*, the heterozygote (Gg) group would be expected to have 37.5% of

*N*, and the homozygote (GG) group would be expected to have 6.25% of

*N*. We investigated a fairly large total sample sizes of

*N* = 400 and set the sample sizes for each genotype at the expected value (e.g., with

*P* = 0.25, the sample size for the GG genotype was set at 25). In keeping with other statistical research on the Behrens–Fisher problem, we set two basic patterns of variance. One pattern is referred to as a positive pairing where the larger groups have a larger variance. In the case of

*P* = 0.5, the smaller homozygote groups will have a smaller variance,

and the larger heterozygote group will have a larger variance,

With

*P* = 0.25, the largest homozygote (gg) group will have the largest variance,

, and heterozygote group will have a variance of

, and the smallest homozygote (GG) group will have the smallest variance,

1. A second pattern is referred to as a negative pairing where the larger groups have a smaller variance. In the case of

*P* = 0.5, the smaller homozygote groups will have a larger variance,

and the larger heterozygote group will have a smaller variance,

. With

*P* = 0.25, the largest homozygote (gg) group will have the smallest variance,

, and heterozygote group will have a variance,

, and the smallest homozygote (GG) group will have the largest variance,

.

When there a positive pairing of sample sizes and variances (lower half of , ), the Type 1 error rate of the ANOVA

*F*-test applied is generally suppressed if the error distributions are Normal or at least symmetric (LaPlace). Also, when there a negative pairing of sample sizes and variances (upper half of , ) and the error distributions are symmetric (Normal and LaPlace), the Type 1 error rate of the ANOVA

*F*-test applied is generally inflated. In these situations, the Welch test applied to either the untransformed scores (

*Y*) or to the INT scores (Blom) helps correct the Type 1 error rate. Therefore, in these circumstances, the INT may lead to a valid Type 1 error rate, but it does not demonstrate a clear advantage over using the untransformed data. Consistent with other studies (e.g.,

James, 1959;

Yuen, 1974), if the error distribution are skewed the Welch test performed on the untransformed scores may not lead to a correct Type 1 error rate regardless of the sample size and variance pattern. What is most alarming, however, is that with skewed error distributions, the INT severely worsens the Type 1 error rate inflation of the ANOVA

*F*-test and the Welch test. As can be seen in and , the Welch test applied to the INT (Blom) has rejection rates approaching 100% when the error distributions are skewed (

; Weibull), regardless of the sample size and variance pattern. This is to be expected to some extent because the INT is based on ranks and rank-based tests are known to be sensitive to heterogeneity of variance (e.g.,

Keselman, et al. 1977;

Zimmerman, 1996;

Zumbo and Coulombe 1997). As compared to other simulation studies on the Behrens–Fisher problem, we used a very realistic disparity among the sample sizes based expected frequencies of genotypes and we did not use extreme disparities in the variances. Other studies have used variance ratios upto tenfold. In our simulations, the smallest variance was 1.0 and the largest was 2.0. In this situation, applied researchers, even if they is familiar with the Behrens–Fisher problem, might not be alarmed if the within-genotype distributions are skewed and one genotype has a standard deviation of 1 and another genotype has a standard deviation of 1.4. However, we show that in this case the Welch test may not maintain a valid Type 1 error rate, and furthermore, apply the Welch test to INT scores severely worsens the problem.

| **Table 2**Empirical Type 1 error rates for tests of equality of three means with total sample size of *N* = 400, allele frequency of *P* = 0.50 (*n*_{GG} = 100, *n*_{Gg} = 200, *n*_{gg} = 100), negative (upper-half), and positive (lower-half) pairing of sample sizes and variances (more ...) |

| **Table 3**Empirical Type 1 error rates for tests of equality of three means with total sample size of *n* = 400, allele frequency of *P* = 0.25 (*n*_{GG} = 25, *n*_{Gg} = 150, *n*_{gg} = 225), negative (upper-half), and positive (lower-half) pairing of sample sizes and variances |

Statistical power

Wilcox (1995) persuasively argued that in many cases the compelling rationale for the use of non-parametric tests over parametric tests is not necessarily preservation of Type 1 error rate, but rather enhancement of power. To address this issue, we simulated data under an alternative hypothesis. Specifically, the variance of

*Y* was 1.0 for each of the two groups and between-group mean difference of 0.5 and 0.125 was added. shows that for smaller samples sizes the non-parametric approach of performing permutation tests on the untransformed scores (

*Y*) and on the ranks does indeed yield more power than a parametric approach, at least for the two larger α levels. At α = 0.05, the permutation test performed on the untransformed scores has the most power. At α = 0.01, permutation tests performed on

*Y* and the ranks have identical power, which is due to the fact that in the tails of these permutation distributions are identical. At α = 0.001, both permutation test have zero power because they both have a minimum

*P* value of 0.0079. With smaller sample sizes, the

*t*-test performed on INT (Blom) scores failed to hold the Type 1 error rate () and thus yields spuriously higher power; however, if a permutation test were performed on the INT scores, it would have power identical to the permutation test performed on ranks with

*N* = 5 per group.

| **Table 4**Empirical power for tests of equality of two means |

With larger sample sizes, the Type 1 error rates for

*t*-tests performed on

*Y* and the Blom scores and the Kruskal–Wallis approximate rank test were not inflated, and thus, it is valid to compare their power rates, even though the parametric test with a suppressed false positive rate is at a disadvantage (

Bradley 1978). With a larger sample size of 25 per group, the

*t*-test performed on INT scores and the rank-based Kruskal–Wallis approximate test demonstrated more statistical power than the parametric test (see ). With a larger effect size of 0.5 standard deviation units, the rank-based test had slightly more power than the INT approach. For the extremely non-normal Weibull distribution, we reduced the effect size to 0.125 and found that the INT approach had a slight power advantage over the rank-based procedure. Whether this constitutes a consistent power advantage is debatable.

Another concern that occurs frequently is an outcome measure (or phenotype) for which many subject have the same score. For example, when blood is drawn, it is very common for many subjects to have zero level of some measures. This creates a problem because it violates the normality assumption of the *t*-test. For an INT this would result in many subjects having the same transformed score and would not remedy the issue of ties. In terms of rank-based tests, the permutation distribution of ranks is affected by ties. Thus, for smaller sample sizes we suggest using a permutation test or an exact test for ranks; performing the parametric *t*-test on INT scores does not seem appropriate. For larger samples, the Kruskal–Wallis test has a well-known correction for ties. The *t*-test is likely to be robust but have low power.

In a small scale simulation, we generated normally distributed data (*Y*) with a mean of 0 and variance of 1 for both groups and set all values below zero equal to zero, thus creating a distribution for *Y* that has 50% of the values equaling zero and the rest of values are positive (i.e., a positively skewed distribution with many ties). We generated data for two samples sizes that could make permutation tests very time consuming (two groups with equal sample sizes on *n*_{1} = *n*_{2} = 12 and 20), although one may not be convinced that the asymptotic properties of the *t*-test and the WMW test apply at these sample sizes. We found that the *t*-test, WMW and the *t*-test applied to the INT scores maintained a valid Type 1 error rate at α = 0.05, 0.01 and 0.001. The WMW and the *t*-test applied to INT scores had very similar statistical power, whereas, the *t*-test performed on the original data had substantially less power (results not tabled).

The intricacies of the differences among the power functions of the *t*-test, WMW, and the *t*-test performed on INTs with different sample sizes, effect sizes, and error distributions need further investigation. However, this does suggest some circumstances in which performing the *t*-test on INTs could be useful (i.e., analysis of extremely non-normal data from simple research designs with sample sizes large enough to make permutation testing intractable, especially when smaller effect sizes are suspected).

How do rank-based INTs fare in complex models?

One might also wonder how the INT approach will perform in the more complex and sophisticated tests that geneticists often employ. The use of rank transformed data in more complicated models has many statistical issues. For example, when other blocking factors are present,

Hodges and Lehmann (1962) illustrated the necessity of “aligning” the data before rank transformation. This is because in a factorial design the expected value of ranks for an observation in one cell has a non-linear dependence on the original means of the other cells (

Headrick and Sawilowsky 2000;

Thompson 1991). Consequently, interaction and main effect relationships are not maintained after rank transformations are performed (

Blair et al. 1987). Parametric tests for interaction applied to ranks lack an invariance property, which produces distorted Type 1 and Type 2 error rates and have performed poorly compared with their normal theory counterparts (e.g.,

Salter and Fawcett 1993;

Mansouri and Chang 1995;

Toothaker and Newman 1994). Interaction tests for the rank transform (

Conover and Iman 1981) have also performed poorly for a variety of other designs (

Akritas 1990;

Thompson 1993) including polynomial and response surface regression (

Headrick and Rotou 2001), analysis of covariance (

Headrick and Sawilowsky 2000;

Headrick and Vineyard 2001) and repeated measures designs (

Beasley 2002). These findings support

Hora and Conover's (1984) warning that simply ranking the data does not result in an adequate test for non-additivity (i.e., interaction), which has implications for using rank-based INTs when evaluating epistasis and gene × environment interactions.

In a genome wide association study of asthma-related phenotypes,

Dixon et al. (2007) applied an INT to each trait and subsequently examined SNP interactions, some of which were on the same chromosome, and thus, potentially in linkage disequilibrium (LD) to some degree. In a QTL, study of cold tolerance in sorghum,

Knoll and Ejeta (2008) tested two- and three-way gene interactions after applying the

van der Waerden (1952) INT to their dependent variables. In both of these studies the researchers applied an INT to the dependent variables, which would normalize the marginal distribution of the phenotypes, not the residuals.

Knoll and Ejeta (2008) cited

Mansouri and Chang's (1995) suggestion that the use of normal scores allows for the analysis of interactions. However,

Mansouri and Chang (1995) only show that this practice is valid in balanced designs with normal error distributions, in which case the INT would not be needed. Furthermore, Mansouri and Chang showed the INT approach to be a conservative test for interaction.

To illustrate this issue in the context of genetic studies of epistasis, we conducted a simulation study for tests of gene x gene interactions in association studies.

Salter and Fawcett (1993) showed that rank-based test applied to testing interactions in the presence of main effects resulted in inflated Type 1 error rates. Therefore, we generated data for two models with genotypic main effects, but no interaction (epistasis):

where,

*G*_{1} and

*G*_{2} are variables representing diallelic markers coded as: −1 for the homozygote with two copies of the minor allele; 0 for the heterozygote; and 1 for the homozygote with two copies of the major allele. Allele frequencies for

*G*_{1} and

*G*_{2} (

*P*_{1} and

*P*_{2}, respectively) were set at 0.5 and 0.25. LD between the G

_{1} and G

_{2} markers was defined by a correlation coefficient (

*r*) and set at three levels

*r* = 0 (No LD), 0.3 (moderate LD) and 0.7 (strong LD). The coefficients β

_{1}, β

_{2}, and β

_{3} were all set at one; β

_{0} was set at zero. Thus, Model 2 has only additive genetic effects, whereas Model 3 has an additive effect for the first marker and a dominance effect for the second marker. Similar to our other simulations, we generated four error terms (ε): Normal, LaPlace,

, and Weibull (λ = 1;

*k* = 0.5).

The untransformed data and INT scores were analyzed with two statistical models. One model assumes that only additive effects exist and the interaction is therefore analyzed as the cross-product of G

_{1} and G

_{2}:

It should be noted that even for data meeting the normality assumption, this model is misspecified for data generated with the Dominant Main Effect Pattern (3).

The second approach uses

Cockerham's (1954) approach to model both the additive and non-additive effects of G

_{1} and G

_{2}:

yielding tests with 2

*df* for each main effect. The interaction is therefore analyzed as the cross-product of the G

_{1} and G

_{2}, main effect terms, yielding a 4

*df* interaction test (

*H*_{0}: β

_{5} = β

_{6} = β

_{7} = β

_{8} = 0). It should be noted that this model is appropriately specified for data generated with both Additive (2) and Dominant Main Effect (3) Patterns.

shows that when the INT is applied to data with additive genotypic main effects (2) and skewed (

; Weibull) error distributions, both tests for epistasis (gene–gene interaction) have drastically inflated Type 1 error rates. reports the Type 1 error rates for data with additive and dominant genotypic main effects with no epistasis (3). These results show that even with normally distributed errors using a misspecified model (i.e., using the Additive Effects Only Model 4 when Dominant Main Effects exist) will lead to severely inflated Type 1 error when the markers are in LD, and this inflation worsens as the degree of LD increases (results not tabled). Applying the INT to the data does not alleviate this problem and as was the case in the condition presented in , the Type 1 error rates are inflated even when the properly specified model (5) is applied to INT transformed data with skewed errors. More subtly, these results show that INTs applied to data with symmetric error distributions (Normal; LaPlace) tends to suppress the Type 1 error rate, which will lead to a reduction in statistical power to detect epistasis. demonstrates that the Type 1 error rates for the INT transformed data are problematic even with different allele frequencies. Consistent with

Wang and Huang (2002) who showed that the using INTs in QTL testing can inflate Type 1 error rate with non-additive genetic models, these simulations show that after applying an INT, tests for epistasis (i.e., gene–gene interaction, which is also a non-additive effect) can have inflated Type 1 error when error distributions are skewed and reduced power as compared to the untransformed data when error distributions are symmetric.

| **Table 5**Type 1 error rates for tests of epistasis (Additive main effect pattern, Eq. 2) |

| **Table 6**Type 1 error rates for tests of epistasis (Dominant Main Effect Pattern, Eq. 3) |

| **Table 7**Type 1 error rates for tests of epistasis (Additive Main Effect Pattern, Eq. 2) |

Several studies have shown that aligning the data before ranking yields valid tests of the interactions in factorial designs (

Beasley 2002;

Higgins and Tashtoush 1994). In the studies we reviewed, however, an INT was applied to the phenotype, which would affect the marginal distributions (main effects), not the residuals. The phenotypic data were not aligned before the application of the INT. Given the functional relationship of ranks and INTs and the issues we have noted with INTs, after the data are aligned and ranked, we do not anticipate any particular advantage to using an INT over the ranks themselves.

Hettmansperger and McKean (1977) and

Jaeckel (1972) have developed approaches based on the rank of residuals that are robust and efficient relative to parametric methods for linear models when the normality assumption is not met.

Conover (1973) has shown that functions of ranks can be derived for any alternative hypothesis, resulting in locally most powerful tests.

Finally, if one wants a non-parametric test in more complex situations, a legitimate permutation test can be constructed in the overwhelming majority of cases. However, see

Allison et al. (2006) and references therein for nuances and caveats. For example, one caveat is that the analysis of INTs, which have ranks as their basis, fundamentally changes the null hypothesis being tested, even if one performs a parametric

*t*-test on INTs in a simple two-group design (

Bradley 1968). The permutation distribution of ranks does not change due to factors such as within-group variance or distributional shape. Thus, without assuming that the error distributions have identical variance and shape, the

*t*-test performed on INTs evaluates a null hypothesis of identical distributions, not necessarily identical location parameters (means). Thus, a statistical significant

*t*-test performed on an INT indicates that the groups are not from identical populations, but that does necessarily indicate that the difference is strictly due to mean differences (

Vargha and Delaney 1998). Further, in more complicated statistical models, the null hypothesis being tests may become even less obvious (

Beasley and Zumbo 2003).