Permutation tests (also called randomization tests) are non-parametric procedures for determining statistical significance based on rearrangements of the labels of a dataset (Edgington,

1980). A test statistic, which is computed from the dataset, is compared with the distribution of permutation values. These permutation values are computed similarly to the test statistic, however, under a random rearrangement (permutation) of the labels of the dataset.

Permutation tests have become a widely used technique in bioinformatics. The non-parametric nature of these tests rationalizes their usability and popularity, since in many bioinformatics applications there is no solid evidence or sufficient data to assume a particular model for the obtained measurements of the biological events under investigation.

For example, Significance Analysis of Microarrays (SAM) (Tusher

*et al.*,

2001) and Gene Set Enrichment Analysis (GSEA) (Subramanian

*et al.*,

2005), which detect differentially expressed genes and gene sets, respectively, are two well-known techniques that use permutation tests to compute statistical significance. In these approaches, the class labels of samples from which gene expression measurements are taken, are randomly rearranged to obtain the permutation values. Besides randomization over the set of samples, permutation tests have also been performed by randomizing over the set of genes (Breitling

*et al.*,

2004; Smyth,

2004). In this case, the labels are binary indicator variables that indicate whether a gene belongs to a particular gene set or not. Efron and Tibshirani (

2007) suggested to employ both permutation types to test the significance of gene sets. Other examples of permutation tests in bioinformatics include, but are not limited to: QTL detection (Doerge and Churchill,

1996), allelic association analysis (Zhao

*et al.*,

2000) and modeling ChIP sequencing (Zhang

*et al.*,

2008). In the latter case, each permutation corresponds to the simulation of a complete ChIP-seq experiment.

As in all statistical hypothesis tests, the significance of a permutation test is represented by its

*P*-value. The

*P*-value is the probability of obtaining a result at least as extreme as the test statistic given that the null hypothesis is true. In permutations tests, the null hypothesis is defined as: the labels assigning samples to classes are interchangeable (Edgington,

1980). Significantly, low

*P*-values indicate that the labels are not interchangeable and that the original label configuration is relevant with respect to the data. The

*P*-value is assessed by performing all possible permutations and computing the fraction of permutation values that are at least as extreme as the test statistic obtained from the unpermuted data.

However in practical situations, it is (by far) not feasible to perform all possible permutations. For example, class labels that represent two classes with 50 samples each can be permuted in

different ways. Therefore, the

*P*-value is approximated by computing a limited number of permutations, say

*N*, and then computing the fraction of the

*N* permutation values that are at least as extreme as the test statistic. Usually, a pseudocount is added to avoid

*P*-values of zero, which occur when the test statistic is never surpassed by the permutation values. Theoretically, a

*P*−value of zero is not possible in the context of permutation tests: the minimum is 1/

*N*_{all}, where

*N*_{all} is the number of all possible permutations. This is because one of the permuted label configurations is identical to the original one, under which the test statistic is computed.

This empirical approximation of computing *P*-values has two direct consequences. First, the resolution of obtainable *P*-values is 1/*N*. Second and more important, the smallest achievable *P*-value is 1/*N*. This means that a very large number of permutations is required to accurately estimate a small *P*-value. In general, >*N* permutations are required to reliably estimate a *P*-value of 1/*N*. (This is shown more extensively later in the manuscript.) Especially in bioinformatics, it is crucial to be able to accurately determine small *P*-values. This is due to typically huge numbers of objects [e.g. all genes, gene sets or single nucleotide polymorphism (SNPs)] that are simultaneously tested, which requires large multiple testing corrections to prevent large numbers of false positives. Clearly, other significance scores that are based on the *P*-values, such as the false discovery rate (FDR), will be meaningless when the *P*-values are not estimated correctly.

In this work, we propose to estimate the small permutation test

*P*-values using extreme value theory (Gumbel,

1958). The set of extreme (very large or very small) permutation values that forms the tail of the distribution of permutation values is modeled as a generalized Pareto distribution (GPD). Pickands (

1975) demonstrated that the GPD approximates the distribution of the extreme values of a set of independent and identically distributed (i.i.d). random variables, i.e. those values that exceed a particular (high) threshold. Applications of the GPD to model extreme values are traditionally found in climatology to model extreme weather, such as floods, and in financial risk management to model extreme losses and insurance claims. In our case, the GPD, which is fitted on the extreme permutation values, is evaluated at the value of the test statistic to estimate the

*P*-value of the permutation test. Both theoretical probability distributions as well as gene expression datasets are employed to demonstrate that the proposed tail approximation strategy leads to an accurate estimation of the correct

*P*-value using far fewer permutations compared with the standard empirical approach.