|Home | About | Journals | Submit | Contact Us | Français|
Many complex traits studied in genetics have markedly non-normal distributions. This often implies that the assumption of normally distributed residuals has been violated. Recently, inverse normal transformations (INTs) have gained popularity among genetics researchers and are implemented as an option in several software packages. Despite this increasing use, we are unaware of extensive simulations or mathematical proofs showing that INTs have desirable statistical properties in the context of genetic studies. We show that INTs do not necessarily maintain proper Type 1 error control and can also reduce statistical power in some circumstances. Many alternatives to INTs exist. Therefore, we contend that there is a lack of justification for performing parametric statistical procedures on INTs with the exceptions of simple designs with moderate to large sample sizes, which makes permutation testing computationally infeasible and where maximum likelihood testing is used. Rigorous research evaluating the utility of INTs seems warranted.
The validity of many statistical tests depends on the assumption that residuals from a fitted model are normally distributed (Berry 1993). In contrast, many complex traits studied in genetics have markedly non-normal distributions (Micceri 1989; Allison et al. 1999), which in many cases implies non-normal residuals. Several approaches exist to respond to non-normality, including but not limited to reliance on asymptotic properties (Mehta et al. 2004), transformation of the data (Etzel et al. 2003; George and Elston 1987; Shete et al. 2004; Yang et al. 2006), and the use of nonparametric tests (Neave and Wothington 1989), which subsumes the analysis of rank data (e.g., Zak et al. 2007), permutation tests, and bootstrap approaches as special cases (Good 1999).
Recently, the rank-based inverse normal transformation (INT) has gained in popularity among genetic researchers. INTs have been applied in a variety of genetic research designs, which have included a wide range of species. For example, INTs have been applied to model the heritability of change in startle response after training among mice (Valdar et al. 2006), microarray data in a rodent model (Przybyla-Zawislak et al. 2005), SNP associations in Rheumatoid Arthritis (Kraja et al. 2007), the family transmission and heritability of externalizing disorders in humans (Hicks et al. 2004), linkage to human body mass index (BMI) (Wu et al. 2002), genome wide association studies with BMI (Scuteri et al. 2007), twin studies of psychopathic personality traits (Blonigen et al. 2003) and frontal brain functioning in humans (Anokhin et al. 2003), genome-wide linkage scans of musical aptitude (Pulli et al. 2008) and attention deficit disorder (Nanda et al. 2008), genome wide association and SNP interactions in asthma-related traits (Dixon et al. 2007), QTL analysis and gene interactions in the cold tolerance of sorghum (Knoll and Ejeta 2008), and in multiple other genetic studies (e.g., Ashton and Borecki 1987; Hicks et al. 2007; Martin and Crawford 1998; Silverman et al. 1990; Tzou et al. 1991). INTs are implemented as an option in at least three statistical genetic software packages (e.g., Analysis System 130 (2003), POLY (2003); see Chen and Abecasis 2006) and SOLAR (2008); see Almasy and Blangero (1998)). Despite this recent widespread use, we are unaware of thorough discussions of this issue. Therefore, we offer this commentary to consider arguments and evidence regarding the benefits and disadvantages of INTs in modern genetic research.
INTs are ways of transforming the sample distribution of a continuous variable to make it appear more normally distributed. There are several types of INTs. The first distinction we make is between rank-based and non-rank-based INTs. Non-rank-based INTs entail assuming a particular cumulative distribution function (CDF) for the observed data, estimating the parameters of that distribution, converting observed scores to estimated quantiles from the CDF, and then converting these quantiles to standard normal deviates using the inverse normal (or probit function). Such non-rank-based INTs are usually referred to as copulas (Basrak et al. 2004; Li et al. 2006) and will not be considered further. It is worth noting, however, that the rank-based INTs can be expressed as a special case of the copula method in which the empirical CDF is used instead of restricting the CDF to some family of distributions. That is, every moment is in effect estimated from the data and the quantiles become simple functions of the ranks.
Rank-based INTs involve a preliminary step of converting a variable to ranks and can be further subdivided into two classes: those that involve a stochastic element and those that are deterministic. We are aware of only one INT that involves a stochastic element and this approach has been referred to as the use of “random normal deviates” (Conover 1980). One deterrent to this approach is that each investigator applying the same method to the same dataset will obtain a slightly different answer, which might be unsatisfying to some. This approach has the theoretical advantage of avoiding the granularity in the distribution of P values, an issue which often plagues many nonparametric tests. Nevertheless, the stochastic nature of these INTs seems to discourage researchers and they are rarely, if ever, used.
Deterministic rank-based INTs can be classified into those that use expected normal scores (Fisher and Yates 1938) versus those that use back transformation of sample quantile (or fractional rank) to approximate the expected normal scores. Using numerical integration, Harter (1961) has provided the most complete table of expected normal scores. INTs that involve back transformation of fractional ranks to approximate the expected normal scores of Fisher and Yates (Maritz 1982) appear to be the most commonly used in genetic research and will be the primary focus of attention. In back-transforming ranks, a fractional offset is needed to avoid having the minimum and maximum observations transformed to negative and positive infinity, respectively. Perhaps the most commonly used rank-based INT transformation entails creating a modified rank variable and then computing a new transformed value of the phenotype for the ith subject:
where ri is the ordinary rank of the ith case among the N observations and Φ−1 denotes the standard normal quantile (or probit) function. Blom (1958) recommended the value of c = 3/8. Other such INTs are minor variations, with c replaced by other values. The Blom transformation is available as an automated option in software such as SAS and SPSS, which includes three other INTs: Tukey (1962; c = 1/3), Rankit (Bliss 1967; c = 1/2), and van der Waerden (1952; c = 0). Because the Blom transformation with c = 3/8 appears to be the most commonly used, we will focus on it, but the other INTs are virtually linear transformations of the Blom and are so similar to the expected normal scores that it is unlikely to make any difference which one is used (Tukey 1962). Thus, our comments apply to all deterministic rank-based INTs.
As stated above, most parametric tests assume that residuals from a model are normally distributed. In situations where two or more groups (e.g., genotypic groups or identity by descent classes) are being compared, this implies that within-group phenotypic distributions (conditional on any covariates included in the model) are normally distributed. In contrast, use of an INT assures that the marginal distribution of the phenotype is nearly normal, which is only appropriate under a complete null hypothesis (i.e., all effects in the model are null). Thus, approximate normality is assured for the wrong distribution (i.e., for the overall phenotype, not necessarily the residuals). This is in contrast to some other transformations, such as the Hodges and Lehmann (1962) method, which “aligns” the data (i.e., removes the effects of other variables) before rank transformation, or the Box–Cox transformation (Box and Cox 1964), which can be implemented such that it maximizes the normality of the sample residuals.
This limitation of INTs was recognized by Servin and Stephens (2007) who wrote: “Regarding the normality assumption, following a suggestion by Mathew Barber (personal communication), in practical applications, we are currently applying a normal quantile transform to phenotypes (replacing the rth biggest of N observations with the (r − 0.5)/Nth quantile of the standard normal distribution) before applying our methods. Imposing normality on our phenotype in this way is different from the normality assumption in our phenotype model, which states that the residuals are normally distributed. However, in this context, where effect sizes are expected to be generally rather small, normality of phenotype and normality of residuals are somewhat similar assumptions, suggesting that this transform may be effective.” [Emphasis added.] This also exposes a problem of INTs; they do not make any population distribution normal, they merely make particular sample distributions appear near-normal.
We have conducted a thorough search of the literature and contacted colleagues in the field, particularly those that use INTs in genetic studies, and have been unable to identify such theoretical work in the context of genetic studies. Few of the genetic studies using an INT that we read cited any evidence that it had any particular statistical properties or benefits. One of the few examples, Hicks et al. (2004) wrote “…we used a normalizing (Blom) transformation that has been shown to optimize model selection when analyzing psychiatric symptom count data” and cited van den Oord et al. (2000) in support. Yet a careful reading of van den Oord et al.'s paper and personal communication with van den Oord (8/23/2007) reveal that van den Oord et al. (2000) actually examined the performance of a simple rank transformation rather than an INT in comparison to other procedures for optimizing model selection and parameter estimation.
In the classic non-parametric statistics literature, however, there are many writings on the use of INTs and a few key points can be gleaned. First, it has been shown that among all possible rank-based INTs, those based on the Fisher and Yates expected normal scores are the most powerful (Barnard 1957; Bradley 1968). In practice, however, all deterministic rank-based INTs are so similar that it is unlikely to make any difference which one is used (Tukey 1962). Second, the original presentations of tests based on INTs stress that the inference should be conducted via permutation if the test is to be exact (Sprent and Smeeton 2001), although in practice this is often eschewed in favor of parametric tests applied to INT scores.
Lastly, compared to the most powerful parametric tests, tests of group differences in location (e.g., mean) involving INTs have asymptotic relative efficiency (ARE) ≥1.0 (Bradley 1968; Chernoff and Savage 1958; Conover 1980). Unfortunately, ARE does not necessarily imply relative efficiency in any particular situation leaving open the question as to how well INTs will perform in specific situations. Moreover, while parametric tests with deterministic INTs have AREs ≥1.0, so too do permutation tests with the untransformed scores (Stuart 1954).
We have been able to identify only a few simulation studies in the genetics context that addresses the utility of rank-based INTs. Specifically, Wang and Huang (2002) developed a likelihood-based methodology in which a rank-based INT increased power. Their score-statistic approach to QTL mapping is asymptotically equivalent to the corresponding likelihood-ratio test and assumes normally distributed errors. In a series of simulations, the authors showed that with skewed () error distributions, the rank-based INT approach leads to noticeably greater power; however, Type 1 error control was inconsistent, especially with non-additive genetic models.
Peng et al. (2007) conducted a series of simulations of right-skewed (i.e., non-normal) phenotypes within pedigrees, performed Amos' (1994) maximum likelihood (ML) variance components (VC) analysis on both untransformed and INT scores, specifically van der Waerden normal scores (Eq. 1 with c = 0), and compared them to a semiparametric QTL (SQTL) method (Diao and Lin 2005). VC analyses applied to INT scores maintained valid Type 1 error rates, while Type 1 error rates from the untransformed data resulted in inflated Type 1 errors. The SQTL method, on the other hand, appeared overly conservative at α = 0.05 and less conservative at α = 0.001 when compared with the INT-transformed results. In preliminary simulations, we found similar results in the context of testing VC in a twin study with an ACE model (i.e., model with additive genetic, common environmental, and unique environmental effects). Kraja et al. (2007) also found that the INT performed well in a large sample ML-based SNP association testing context. Therefore, the INT appears to perform nearly as well as if the true transformation to normality were known and applied, but it remains arguable whether the INT method applied to ML estimation of VC is uniformly preferable to the SQTL method.
Our review of the general literature on non-parametric statistics shows that use of INT scores in parametric tests sometimes yields superior performance compared to parametric tests with untransformed data or other non-parametric alternatives when the normality assumption is not met, but in other circumstances INTs can yield poor performance (e.g., Pratt 1964; Keselman et al. 1977), especially when compared to other non-parametric alternatives (e.g., Knoke 1991). Because relative performance will vary by condition studied, it is important to study INTs in conditions that are similar to those used in modern complex trait genetic studies. Unfortunately, the specific circumstances under which INTs are currently being used often differ substantially from the conditions under which INTs and other rank-based approaches were studied in the past (in terms of both the tests that were studied and the α levels used) making the empirical work (e.g., simulations) of bygone days of lesser utility and suggesting the need to continue empirical evaluations in our present context.
Because the marginal distribution of the sample phenotype is almost certain to be nearly normally distributed after an INT, it is tempting to conclude that this will guarantee a correct Type 1 error rate when the normality of residuals assumptions is violated. Indeed, in a recent PLoS Genetics article, Scuteri et al. wrote “To ensure adequate control of Type 1 error rates, we applied an inverse normal transformation to each trait prior to analysis.” [Emphasis added.] Yet, INTs do not necessarily fulfill the fundamental assumption of normally distributed residuals. It therefore seems a fait accompli that the use of INTs cannot guarantee that tests assuming normality will have correct Type 1 error rates on strictly theoretical grounds. Nevertheless, it might be conjectured that, in practice, INTs do offer approximately correct Type 1 error rates.
To explore this issue further, we conducted a number of simulations to illustrate key points. First, we investigate testing of the equality of two means for a continuous outcome (Y) based on two independent random samples of n1 = n2 = 5 subjects. These groups could be, for example, treated and untreated subjects in a microarray study or wild-type and knock-out mice in a gene manipulation study, where such small sample sizes are quite common. We began by simulating the situation in which all assumptions of the t-test were met and the null hypothesis was true (i.e., there was no mean group difference in the two populations sampled). As can be seen in the first two rows of Table 1, when the normality assumption is met and no transformation (Y) is used, the correct Type 1 error rates are obtained as expected. In contrast, performing the parametric t-test on the INT (Blom) scores produced significant inflation of the Type 1 error rate (i.e., too many false positives) at the α = 0.05 and 0.01 levels, whereas the Type 1 error rate at a nominal α = 0.001 level was zero. These discrepancies from the expected Type 1 error rates are due to the fact that the t-test performed on INT scores produces P values that are smaller than the P values yielded by the permutation distribution of ranks. For α = 0.05, the problem of granularity of the permutation distribution of ranks does not subside until the samples are greater than 10 cases per group (Bradley 1968).
The previous simulations clearly showed that INTs do not ensure adequate control of the Type 1 error rate in all situations. In response, one might argue that one would not use an INT when the normality assumption is met, so the aforementioned findings are irrelevant. However, one never knows unequivocally whether the normality assumption is met and assessing the fit of the sample data to the normality assumption is not very useful in small samples (Farrell and Rogers-Stewart 2006). Nevertheless, it could be conjectured that INTs will help control Type 1 error rates when the normality assumption is violated. Therefore, we followed the approach of Wang and Huang (2002) and simulated data from three distributions that were markedly non-normal. We simulated a one degree-of-freedom (df) chi-square distribution (), which has a skewness of 2.83 and kurtosis of 12, and another highly-skewed heavy-tailed distribution, Weibull (λ = 1; k = 0.5), which has a skewness of 6.39 and kurtosis of 76. We also simulated a heavy-tailed (kurtosis of 3), but symmetric LaPlace distribution. Table 1 shows that the t-test does not make excess Type 1 errors with departures from the normality assumption (under the conditions we have simulated), but rather tends to be conservative. The INT approach as well as permutation tests performed on Y (Good 1999) and the ranks (Wilcoxon 1945; Mann and Whitney 1947) returns virtually the same Type 1 error rates as when the normality assumption was met. Again, the reason for these virtually identical results with the INT across distributions is due to the permutation distribution of ranks. If a permutation test were performed on the INT scores, it would have Type 1 error rates identical to the permutation test performed on ranks. Thus, even when the data are markedly non-normal, performing a parametric test on the INTs does not ensure correct Type 1 error rates, and thus provides no definitive advantage over the use of untransformed data or ranks in small samples.
Finally, one might argue that while the conclusions above may be relevant to small studies such as microarray or rodent knockout studies, however in larger studies, an INT will perform well. To address this question, we ran simulations with larger (though still modest) sample sizes of N = 50 and 100 (25 and 50 subjects in each of two groups). With larger sample sizes, however, permutation testing becomes more difficult and one may assume that the t-test for the untransformed scores (Y) and approximate tests for ranks will yield valid Type 1 error rates. Table 1 shows that even with these larger sample sizes, the parametric t-test still has a suppressed Type 1 error rate, for asymmetric error distributions. The Type 1 error inflation for the INT scores due to the granular nature of the permutation distribution is eliminated. Therefore, the t-test performed on INT scores has approximately the correct α level, but so too does the rank-based Kruskal and Wallis (1952) approximate test. Thus, while an INT may hold the Type 1 error rate under the nominal alpha asymptotically, so too will the use of untransformed and rank data. Thus, there is no clear advantage to using the INT in terms of controlling the Type 1 error rate of OLS tests in one-way ANOVA-type designs.
To further examine this issue, we generalized our simulation to the one-way ANOVA with three groups. This analysis is extremely common in genetic association studies with a continuous phenotype (Y) as the dependent variable and genotype as the grouping variable. One notable issue with applying the one-way ANOVA to a genetic association studies is that the sample sizes per genotype group are never expected to be equal in the population. So if one randomly samples from the population, the genotype groups would not be of equal size. For example, with a di-allelic marker with a minor allele frequency of P = 0.5, the heterozygote genotype will have twice the sample size of the other two homozygote genotypes. This imbalance in the samples sizes worsens as the di-allelic frequencies deviate from 0.5. Unequal sample sizes in the one-way ANOVA are of particular importance because of the Behrens–Fisher problem. Specifically, the standard one-way ANOVA assumes that each group has the same population variance. The standard ANOVA F-test is robust to heterogeneous variances, as long as the sample sizes are equal (e.g., Kohr and Games 1974; Zimmerman 2004). It is well-known that the Type 1 error rate of the ANOVA F-test is spuriously inflated or suppressed by unequal variances combined with unequal sample sizes. The Welch (1947) separate-variances version of the F-test, which does not use a pooled variance estimate and has modified dfs, usually eliminates these effects. In fact, Zimmerman (2004) concluded that optimum protection of the Type 1 error rate is assured by using the Welch test unconditionally whenever sample sizes are unequal. Although the exact distribution of the Welch statistic is known under normality (Ray and Pitman 1961), it remains an approximate solution to the Behrens–Fisher problem. Monte Carlo studies have shown that the Welch approximate solution is not robust to departures from normality (e.g., James 1959; Yuen 1974). Furthermore, solutions based on nonparametric or nonparametric-like procedures have been unsuccessful. For example, Pratt (1964) showed that the Mann–Whitney U and the expected normal scores test (Hájek and Sidák 1967) resulted in nonrobust Type I error rates. Feir-Walsh and Toothaker (1974) and Keselman, et al. (1977) found the Kruskal–Wallis test (Kruskal and Wallis 1952) and expected normal scores test (McSweeney and Penfield 1969) to be substantially affected by heterogeneity of variance.
To demonstrate these issues in the context of genetic association studies, suppose a continuous phenotype (Y) that under the null hypothesis of no mean differences among the three genotypes is either sampled from either a normal error distribution or from one of the three markedly non-normal error distributions: LaPlace; ; or Weibull. There is one marker of interest, G. For this simulation, allele frequencies for G were set at P = 0.5 and 0.25. Thus, for these two scenarios, the expected sample sizes for each genotype will differ substantially. For P = 0.5, the homozygote (gg and GG) groups would be expected to have 25% of the total sample size (N), while the heterozygote (Gg) group would twice as large with 50% of N. For P = 0.25, the homozygote (gg) group would be expected to have 56.25% of N, the heterozygote (Gg) group would be expected to have 37.5% of N, and the homozygote (GG) group would be expected to have 6.25% of N. We investigated a fairly large total sample sizes of N = 400 and set the sample sizes for each genotype at the expected value (e.g., with P = 0.25, the sample size for the GG genotype was set at 25). In keeping with other statistical research on the Behrens–Fisher problem, we set two basic patterns of variance. One pattern is referred to as a positive pairing where the larger groups have a larger variance. In the case of P = 0.5, the smaller homozygote groups will have a smaller variance, and the larger heterozygote group will have a larger variance, With P = 0.25, the largest homozygote (gg) group will have the largest variance, , and heterozygote group will have a variance of , and the smallest homozygote (GG) group will have the smallest variance, 1. A second pattern is referred to as a negative pairing where the larger groups have a smaller variance. In the case of P = 0.5, the smaller homozygote groups will have a larger variance, and the larger heterozygote group will have a smaller variance, . With P = 0.25, the largest homozygote (gg) group will have the smallest variance, , and heterozygote group will have a variance, , and the smallest homozygote (GG) group will have the largest variance, .
When there a positive pairing of sample sizes and variances (lower half of Tables 2, ,3),3), the Type 1 error rate of the ANOVA F-test applied is generally suppressed if the error distributions are Normal or at least symmetric (LaPlace). Also, when there a negative pairing of sample sizes and variances (upper half of Tables 2, ,3)3) and the error distributions are symmetric (Normal and LaPlace), the Type 1 error rate of the ANOVA F-test applied is generally inflated. In these situations, the Welch test applied to either the untransformed scores (Y) or to the INT scores (Blom) helps correct the Type 1 error rate. Therefore, in these circumstances, the INT may lead to a valid Type 1 error rate, but it does not demonstrate a clear advantage over using the untransformed data. Consistent with other studies (e.g., James, 1959; Yuen, 1974), if the error distribution are skewed the Welch test performed on the untransformed scores may not lead to a correct Type 1 error rate regardless of the sample size and variance pattern. What is most alarming, however, is that with skewed error distributions, the INT severely worsens the Type 1 error rate inflation of the ANOVA F-test and the Welch test. As can be seen in Tables 2 and and3,3, the Welch test applied to the INT (Blom) has rejection rates approaching 100% when the error distributions are skewed (; Weibull), regardless of the sample size and variance pattern. This is to be expected to some extent because the INT is based on ranks and rank-based tests are known to be sensitive to heterogeneity of variance (e.g., Keselman, et al. 1977; Zimmerman, 1996; Zumbo and Coulombe 1997). As compared to other simulation studies on the Behrens–Fisher problem, we used a very realistic disparity among the sample sizes based expected frequencies of genotypes and we did not use extreme disparities in the variances. Other studies have used variance ratios upto tenfold. In our simulations, the smallest variance was 1.0 and the largest was 2.0. In this situation, applied researchers, even if they is familiar with the Behrens–Fisher problem, might not be alarmed if the within-genotype distributions are skewed and one genotype has a standard deviation of 1 and another genotype has a standard deviation of 1.4. However, we show that in this case the Welch test may not maintain a valid Type 1 error rate, and furthermore, apply the Welch test to INT scores severely worsens the problem.
Wilcox (1995) persuasively argued that in many cases the compelling rationale for the use of non-parametric tests over parametric tests is not necessarily preservation of Type 1 error rate, but rather enhancement of power. To address this issue, we simulated data under an alternative hypothesis. Specifically, the variance of Y was 1.0 for each of the two groups and between-group mean difference of 0.5 and 0.125 was added. Table 4 shows that for smaller samples sizes the non-parametric approach of performing permutation tests on the untransformed scores (Y) and on the ranks does indeed yield more power than a parametric approach, at least for the two larger α levels. At α = 0.05, the permutation test performed on the untransformed scores has the most power. At α = 0.01, permutation tests performed on Y and the ranks have identical power, which is due to the fact that in the tails of these permutation distributions are identical. At α = 0.001, both permutation test have zero power because they both have a minimum P value of 0.0079. With smaller sample sizes, the t-test performed on INT (Blom) scores failed to hold the Type 1 error rate (Table 1) and thus yields spuriously higher power; however, if a permutation test were performed on the INT scores, it would have power identical to the permutation test performed on ranks with N = 5 per group.
With larger sample sizes, the Type 1 error rates for t-tests performed on Y and the Blom scores and the Kruskal–Wallis approximate rank test were not inflated, and thus, it is valid to compare their power rates, even though the parametric test with a suppressed false positive rate is at a disadvantage (Bradley 1978). With a larger sample size of 25 per group, the t-test performed on INT scores and the rank-based Kruskal–Wallis approximate test demonstrated more statistical power than the parametric test (see Table 4). With a larger effect size of 0.5 standard deviation units, the rank-based test had slightly more power than the INT approach. For the extremely non-normal Weibull distribution, we reduced the effect size to 0.125 and found that the INT approach had a slight power advantage over the rank-based procedure. Whether this constitutes a consistent power advantage is debatable.
Another concern that occurs frequently is an outcome measure (or phenotype) for which many subject have the same score. For example, when blood is drawn, it is very common for many subjects to have zero level of some measures. This creates a problem because it violates the normality assumption of the t-test. For an INT this would result in many subjects having the same transformed score and would not remedy the issue of ties. In terms of rank-based tests, the permutation distribution of ranks is affected by ties. Thus, for smaller sample sizes we suggest using a permutation test or an exact test for ranks; performing the parametric t-test on INT scores does not seem appropriate. For larger samples, the Kruskal–Wallis test has a well-known correction for ties. The t-test is likely to be robust but have low power.
In a small scale simulation, we generated normally distributed data (Y) with a mean of 0 and variance of 1 for both groups and set all values below zero equal to zero, thus creating a distribution for Y that has 50% of the values equaling zero and the rest of values are positive (i.e., a positively skewed distribution with many ties). We generated data for two samples sizes that could make permutation tests very time consuming (two groups with equal sample sizes on n1 = n2 = 12 and 20), although one may not be convinced that the asymptotic properties of the t-test and the WMW test apply at these sample sizes. We found that the t-test, WMW and the t-test applied to the INT scores maintained a valid Type 1 error rate at α = 0.05, 0.01 and 0.001. The WMW and the t-test applied to INT scores had very similar statistical power, whereas, the t-test performed on the original data had substantially less power (results not tabled).
The intricacies of the differences among the power functions of the t-test, WMW, and the t-test performed on INTs with different sample sizes, effect sizes, and error distributions need further investigation. However, this does suggest some circumstances in which performing the t-test on INTs could be useful (i.e., analysis of extremely non-normal data from simple research designs with sample sizes large enough to make permutation testing intractable, especially when smaller effect sizes are suspected).
One might also wonder how the INT approach will perform in the more complex and sophisticated tests that geneticists often employ. The use of rank transformed data in more complicated models has many statistical issues. For example, when other blocking factors are present, Hodges and Lehmann (1962) illustrated the necessity of “aligning” the data before rank transformation. This is because in a factorial design the expected value of ranks for an observation in one cell has a non-linear dependence on the original means of the other cells (Headrick and Sawilowsky 2000; Thompson 1991). Consequently, interaction and main effect relationships are not maintained after rank transformations are performed (Blair et al. 1987). Parametric tests for interaction applied to ranks lack an invariance property, which produces distorted Type 1 and Type 2 error rates and have performed poorly compared with their normal theory counterparts (e.g., Salter and Fawcett 1993; Mansouri and Chang 1995; Toothaker and Newman 1994). Interaction tests for the rank transform (Conover and Iman 1981) have also performed poorly for a variety of other designs (Akritas 1990; Thompson 1993) including polynomial and response surface regression (Headrick and Rotou 2001), analysis of covariance (Headrick and Sawilowsky 2000; Headrick and Vineyard 2001) and repeated measures designs (Beasley 2002). These findings support Hora and Conover's (1984) warning that simply ranking the data does not result in an adequate test for non-additivity (i.e., interaction), which has implications for using rank-based INTs when evaluating epistasis and gene × environment interactions.
In a genome wide association study of asthma-related phenotypes, Dixon et al. (2007) applied an INT to each trait and subsequently examined SNP interactions, some of which were on the same chromosome, and thus, potentially in linkage disequilibrium (LD) to some degree. In a QTL, study of cold tolerance in sorghum, Knoll and Ejeta (2008) tested two- and three-way gene interactions after applying the van der Waerden (1952) INT to their dependent variables. In both of these studies the researchers applied an INT to the dependent variables, which would normalize the marginal distribution of the phenotypes, not the residuals. Knoll and Ejeta (2008) cited Mansouri and Chang's (1995) suggestion that the use of normal scores allows for the analysis of interactions. However, Mansouri and Chang (1995) only show that this practice is valid in balanced designs with normal error distributions, in which case the INT would not be needed. Furthermore, Mansouri and Chang showed the INT approach to be a conservative test for interaction.
To illustrate this issue in the context of genetic studies of epistasis, we conducted a simulation study for tests of gene x gene interactions in association studies. Salter and Fawcett (1993) showed that rank-based test applied to testing interactions in the presence of main effects resulted in inflated Type 1 error rates. Therefore, we generated data for two models with genotypic main effects, but no interaction (epistasis):
where, G1 and G2 are variables representing diallelic markers coded as: −1 for the homozygote with two copies of the minor allele; 0 for the heterozygote; and 1 for the homozygote with two copies of the major allele. Allele frequencies for G1 and G2 (P1 and P2, respectively) were set at 0.5 and 0.25. LD between the G1 and G2 markers was defined by a correlation coefficient (r) and set at three levels r = 0 (No LD), 0.3 (moderate LD) and 0.7 (strong LD). The coefficients β1, β2, and β3 were all set at one; β0 was set at zero. Thus, Model 2 has only additive genetic effects, whereas Model 3 has an additive effect for the first marker and a dominance effect for the second marker. Similar to our other simulations, we generated four error terms (ε): Normal, LaPlace, , and Weibull (λ = 1; k = 0.5).
The untransformed data and INT scores were analyzed with two statistical models. One model assumes that only additive effects exist and the interaction is therefore analyzed as the cross-product of G1 and G2:
It should be noted that even for data meeting the normality assumption, this model is misspecified for data generated with the Dominant Main Effect Pattern (3).
The second approach uses Cockerham's (1954) approach to model both the additive and non-additive effects of G1 and G2:
yielding tests with 2 df for each main effect. The interaction is therefore analyzed as the cross-product of the G1 and G2, main effect terms, yielding a 4 df interaction test (H0: β5 = β6 = β7 = β8 = 0). It should be noted that this model is appropriately specified for data generated with both Additive (2) and Dominant Main Effect (3) Patterns.
Table 5 shows that when the INT is applied to data with additive genotypic main effects (2) and skewed (; Weibull) error distributions, both tests for epistasis (gene–gene interaction) have drastically inflated Type 1 error rates. Table 6 reports the Type 1 error rates for data with additive and dominant genotypic main effects with no epistasis (3). These results show that even with normally distributed errors using a misspecified model (i.e., using the Additive Effects Only Model 4 when Dominant Main Effects exist) will lead to severely inflated Type 1 error when the markers are in LD, and this inflation worsens as the degree of LD increases (results not tabled). Applying the INT to the data does not alleviate this problem and as was the case in the condition presented in Table 5, the Type 1 error rates are inflated even when the properly specified model (5) is applied to INT transformed data with skewed errors. More subtly, these results show that INTs applied to data with symmetric error distributions (Normal; LaPlace) tends to suppress the Type 1 error rate, which will lead to a reduction in statistical power to detect epistasis. Table 7 demonstrates that the Type 1 error rates for the INT transformed data are problematic even with different allele frequencies. Consistent with Wang and Huang (2002) who showed that the using INTs in QTL testing can inflate Type 1 error rate with non-additive genetic models, these simulations show that after applying an INT, tests for epistasis (i.e., gene–gene interaction, which is also a non-additive effect) can have inflated Type 1 error when error distributions are skewed and reduced power as compared to the untransformed data when error distributions are symmetric.
Several studies have shown that aligning the data before ranking yields valid tests of the interactions in factorial designs (Beasley 2002; Higgins and Tashtoush 1994). In the studies we reviewed, however, an INT was applied to the phenotype, which would affect the marginal distributions (main effects), not the residuals. The phenotypic data were not aligned before the application of the INT. Given the functional relationship of ranks and INTs and the issues we have noted with INTs, after the data are aligned and ranked, we do not anticipate any particular advantage to using an INT over the ranks themselves. Hettmansperger and McKean (1977) and Jaeckel (1972) have developed approaches based on the rank of residuals that are robust and efficient relative to parametric methods for linear models when the normality assumption is not met. Conover (1973) has shown that functions of ranks can be derived for any alternative hypothesis, resulting in locally most powerful tests.
Finally, if one wants a non-parametric test in more complex situations, a legitimate permutation test can be constructed in the overwhelming majority of cases. However, see Allison et al. (2006) and references therein for nuances and caveats. For example, one caveat is that the analysis of INTs, which have ranks as their basis, fundamentally changes the null hypothesis being tested, even if one performs a parametric t-test on INTs in a simple two-group design (Bradley 1968). The permutation distribution of ranks does not change due to factors such as within-group variance or distributional shape. Thus, without assuming that the error distributions have identical variance and shape, the t-test performed on INTs evaluates a null hypothesis of identical distributions, not necessarily identical location parameters (means). Thus, a statistical significant t-test performed on an INT indicates that the groups are not from identical populations, but that does necessarily indicate that the difference is strictly due to mean differences (Vargha and Delaney 1998). Further, in more complicated statistical models, the null hypothesis being tests may become even less obvious (Beasley and Zumbo 2003).
We demonstrated that INTs are not necessarily helpful and can in some cases make things worse. Our results indicate that there may be a few circumstances in which INTs will improve performance of testing procedures that assume normality; however, to our knowledge, data to support this proposition are largely wanting. Furthermore, it seems implausible that use of INTs could be superior to certain types of non-parametric testing (Good 2004).
Contrary to statements in the genetics literature, INTs do not necessarily maintain proper control over Type 1 error rates relative to the use of untransformed data unless they are coupled with permutation testing. Alternatives exist in the form of monotonic non-rank-based transformations, classic rank-based non-parametric tests, and computer intensive resampling methods. In larger sample sizes, the use of INTs offers one the opportunity to conduct valid inference using methods in which one must assume that a specified likelihood is correct even in the absence of permutation. Our results also indicate that INT may be useful for the analysis of extremely non-normal data from simple research designs with sample sizes large enough to make permutation testing intractable, especially when smaller effect sizes are suspected. However, the analysis of INTs does fundamentally change the null hypothesis being tested, which becomes an increasingly complicated issue with increases in the complexity of the statistical model. Furthermore, in more complex models linear models that have interaction terms, rank-based INTs can inflate the Type 1 error in some circumstances and reduce statistical power in other situations. Whether INTs are useful in other genetic research contexts relative to other available testing approaches, remains to be demonstrated. We do not imply that such utility does not exist; however, rigorous research assessing the performance of INTs in situations germane to modern genetic research is warranted.
We thank Christian Dina for asking cogent questions that inspired this commentary and for making useful comments on an earlier draft and also thank Jay Conover, Roger Berger, Brian Hicks, Rui Feng, Michael C. Neale, Goncalo Abecasis, Bernard S. Gorman and Alfred A. Bartolucci for their helpful advice or comments on earlier drafts. This article is supported in part by NIH grants P30DK056336, U54CA100949, R01ES09912, and T32HL072757.