Home | About | Journals | Submit | Contact Us | Français |

**|**HHS Author Manuscripts**|**PMC2965033

Formats

Article sections

- Abstract
- 1. Introduction
- 2. Univariate Unstructured Data
- 3. Univariate Data
- 4. Multivarate Data
- 5. Conclusions
- References

Authors

Related links

Methods Mol Biol. Author manuscript; available in PMC 2010 October 27.

Published in final edited form as:

PMCID: PMC2965033

NIHMSID: NIHMS234771

The Rockefeller University, Center for Clinical and Translational Science, 1230 York Ave Box 322, New York, NY 10021, U.S.A.

Knut M. Wittkowski: ude.rellefekcor@wmk

The publisher's final edited version of this article is available at Methods Mol Biol

See other articles in PMC that cite the published article.

In 2003, the completion of the Human Genome Project[1] together with advances in computational resources[2] were expected to launch an era where the genetic and genomic contributions to many common diseases would be found. In the years following, however, researchers became increasingly frustrated as most reported ‘findings’ could not be replicated in independent studies[3]. To improve the signal/noise ratio, it was suggested to increase the number of cases to be included to tens of thousands[4], a requirement that would dramatically restrict the scope of personalized medicine. Similarly, there was little success in elucidating the gene–gene interactions involved in complex diseases or even in developing criteria for assessing their phenotypes. As a partial solution to these enigmata, we here introduce a class of statistical methods as the ‘missing link’ between advances in genetics and informatics. As a first step, we provide a unifying view of a plethora of non-parametric tests developed mainly in the 1940s, all of which can be expressed as u-statistics. Then, we will extend this approach to reflect categorical and ordinal relationships between variables, resulting in a flexible and powerful approach to deal with the impact of (1) multi-allelic genetic loci, (2) poly-locus genetic regions, and (3) oligo-genetic and oligo-genomic collaborative interactions on complex phenotypes.

As functional genetics and genomics advance and prices for sequencing and expression profiling drop, new possibilities arise, but also new challenges. The initial successes in identifying the causes of rare, mono-causal diseases serve as a proof-of-concept that new diagnostics and therapies can be developed. Common diseases have even more impact on public health, but as they typically involve genetic epistasis, genomic pathways, and proteomic patterns, new requirements for database systems and statistical analysis tools are necessitated.

Biological systems are regulated by various, often unknown feedback loops so that the functional form of relationships between measurement and activity or efficacy is typically unknown, except within narrowly controlled experiments. Still, many statistical methods are based on the linear model[5], i.e., on the assumption that the above (unknown) relationship is linear. The linear model has the advantage of computational efficiency. Moreover, assuming independence and additivity yields conveniently bell shaped distributions and parameters of alluring simplicity. The prayer that biology be linear, independent, and additive, however, is rarely answered and the Central Limit Theorem (CLT) neither applies to small samples nor rescues from model misspecification.

When John Arbuthnot (1667–1735) argued that the chance of more male than female babies being born in London for the last 82 consecutive years was only 1/2^{82} “*if mere Chance govern’d*”, and, thus, “*it is Art, not Chance, that governs*”,[6] he was arguably the first to ever apply the concept of hypothesis testing to obtain what is now known as a ‘p-value’ and, interestingly, he applied it to a problem in molecular biology.

The test, now known as the sign or McNemar[7] test (see below), belongs to the class of ‘non-parametric’ tests, which differ from their ‘parametric’ counterparts, in that the distribution of the data is not assumed to be known, except for a single parameter to be estimated. As they require fewer unjustifiable assumptions to be made, non-parametric methods are more adequate for biological systems, in general. Moreover, they tend to be easier to understand. The median, for instance, is easily explained as the cut-off where as many observations are above as below, and the interquartile range as the range with 25% of the data both below and above. In contrast to the mean and standard deviation, these ‘quartiles’ do not change with (monotonous) scale transformations (such as log and inverse), are robust to outliers, and often reflect more closely to the goals of the investigator[8]: ‘Do people in this group *tend* to score higher than people in that?’ ‘Is the order on this variable *similar to the order* on that?’ If the questions are ordinal, it seems preferable to use ordinal methods to answer them[9].

The reason for methods based on the linear model, ranging from the t-test(s) and analysis of variance (ANOVA) to stepwise linear regression and factor analysis is not their biological plausibility, but merely their computational efficiency. While mean and standard deviation are easily computed with a pocket calculator, quartiles are not. Other disadvantages of ordinal methods are the relative scarcity of experimental designs that can be analyzed. Moreover, non-parametric methods are often presented as a confusing ‘hodgepodge’ of seemingly unrelated methods. On the one hand, equivalent methods can have the different names, such as the Wilcoxon (rank-sum)[10], the Mann–Whitney (u)[11], and the Kruskal–Wallis[12] test (when applied to two groups). On the other hand, substantially different tests may be attributed to the same author, such as Wilcoxon’s rank-sum and signed-rank tests[10].

In the field of molecular biology, the need for computational (bioinformatics) approaches in response to a rapidly evolving technology producing large amounts of data from small samples has created a new field of “in silico” analyses, the term a hybrid of the Latin *in silice* (from *silex, silicis* m.: hard stone[13], 14]) and phases such as *in vivo, in vitro*, and *in utero*, which seem to be better known than, for instance, *in situ, in perpetuum, in memoriam, in nubibus*, and *in capite*. Many in silico methods have added even more confusion to the field. For instance, a version of the t-test with a more conservative variance estimate is often referred to as SAM (significance analysis for microarrays)[15], and adding a similar “fudge factor” to the Wilcoxon signed rank test yields SAM-RS[16]. Referring to ‘new’ in silico approaches by names intended to highlight the particular application or manufacturer, rather than the underlying concepts has often prevented these methods from being thoroughly evaluated. With sample sizes increasing and novel technologies (e.g., RNA-seq) replacing less robust technologies (e.g., microarrays)[17] that often relied on empirically motivated standardization and normalization, the focus can now shift from *ad hoc* bioinformatics approaches to well-founded biostatistical concepts. Below we will present a comprehensive theory for a wide range of non-parametric *in silice* methods based on recent advances in understanding the underlying fundamentals, novel methodological developments, and improved algorithms.

For single-sample designs, independence of the observations constituting the sample is a key principle to guarantee when applying any statistical test to observed data. All asymptotic (large-sample) versions of the tests discussed here are based on the CLT, which applies when (i) *many* (ii) *independent* observations contribute in an (iii) *additive* fashion to the test statistic. Of these three requirements, additivity is typically fulfilled by the way the test statistic is formed, which may be, for instance, based on the sum of the data, often after rank or log transformation. Independence, then, applies to how the data are being aggregated into the test statistic. Often, one will allow only a single child per family to be included, or at least only one of identical twins. Finally, the rule that ‘more is better’ underlies the Law of Large Numbers, which states that the accuracy of the estimates increases with the square root of the number of independent samples included.

Finally, one will try to minimize the effect of unwanted ‘confounders’. For instance, when one compares the effects of two interventions within the same subject, one would typically aim at a ‘cross-over’ design, where the two interventions are applied in a random order to minimize ‘carry-over’ effects, where the first intervention’s effects might affect the results observed under the subsequent intervention.

Another strategy is ‘stratification’, where subjects are analyzed as sub-samples forming ‘blocks’ of subjects that are comparable with respect to the confounding factor(s), such as genetic factors (e.g., race and ethnicity), exposure to environmental factors (smoking) or experimental conditions.

A third strategy would be to ‘model’ the confounding factor, e.g., by subtracting or dividing by its presumed effect on the outcome. As the focus of this chapter is on non-parametric (aka ‘model-free’) methods, this strategy will be used only sparingly, i.e., when the form of the functional relationship among the confounding and observed variables is, in fact, known with certainty.

The so-called sign test is the most simple of all statistical methods, applying to a single sample of binary outcomes from independent observations, which can be discrete, such as the sexes ‘M’ of ‘F’ in Arbuthnot’s case[6], ‘correct vs. false’, such as R.A. Fisher’s famous (Lady Tasting) Tea test[7], where a lady claims to be able to decide whether milk was added to tea, or *vice versa*. When applied to paired observations (‘increase’ vs. ‘decrease’), the sign test is often named after McNemar[18]. (Here, we assume that all sexes, answers, and changes can be unambiguously determined. The case of ‘ties’ will be discussed in Section 2.1.3 below).

When the observations are independent, the number of positive ‘signs’ follows a binomial distribution, so that the probability of results deviating at least as much as the result observed from the result expected under the null hypothesis *H*_{0} (the ‘p-value’) is easily obtained.

Let *X* be the sum of positive signs among *n* replications, each having a ‘success’ probability of *p* (e.g., *p* = ½). Then, the probability of having *k* ‘successes’ (see Fig. 2.1) is

$${\mathrm{b}}_{n,p}\left(k\right)=\mathrm{P}\left(X=k|n,p\right)=\phantom{\rule{thinmathspace}{0ex}}\left(\begin{array}{c}n\\ k\end{array}\right)\phantom{\rule{thinmathspace}{0ex}}{p}^{k}{\left(1-p\right)}^{k}=\frac{n!}{k!(n-k)!}{p}^{k}{(1-p)}^{k}.$$

The (one-sided) p-value is easily computed as ${\sum}_{x=k}^{n}{\mathrm{b}}_{n,p}}\left(x\right).\phantom{\rule{thinmathspace}{0ex}}\text{Let}\phantom{\rule{thinmathspace}{0ex}}{x}_{$, then, for binomial random variables, the parameter estimate *X*/*n* = and its standard deviation

$${s}_{n}(\widehat{p})=\sqrt{\frac{1}{n}{\displaystyle {\sum}_{i=1}^{n}{({x}_{i}-{x}_{}2}^{}}=\sqrt{\frac{1}{n}}\sqrt{\widehat{p}\phantom{\rule{thinmathspace}{0ex}}{\left(1-\widehat{p}\right)}^{2}+(1-\widehat{p})\phantom{\rule{thinmathspace}{0ex}}{(0-\widehat{p})}^{2}}=\sqrt{\frac{1}{n}}\sqrt{\widehat{p}\phantom{\rule{thinmathspace}{0ex}}(1-\widehat{p})}}$$

are also easily derived. More than 100 years after Arbuthnot[6], Gauss[19] proved the CLT, by which the binomial distribution can be approximated asymptotically (as.) with the familiar Gaussian ‘normal’ distribution

$$\sqrt{n}\frac{(\widehat{p}-{p}_{0})}{\sqrt{\widehat{p}(1-\widehat{p})}}{~}_{\text{as}.}\phantom{\rule{thinmathspace}{0ex}}\mathrm{N}\left(0,1\right),\mathrm{i}.\mathrm{e}.,P\phantom{\rule{thinmathspace}{0ex}}\left(\sqrt{n}\frac{(\widehat{p}-{p}_{0})}{\sqrt{\widehat{p}(1-\widehat{p})}}>z\right)\underset{n\to \infty}{\to}{\displaystyle \underset{-\infty}{\overset{z}{\int}}\frac{1}{\sqrt{2\pi}}}{e}^{-{u}^{2}/2}du$$

The former (exact) form of the sign test can be derived and applied using only elementary mathematical operations, while the latter, like the parametric equivalent, requires theoretical results that most statisticians find challenging to derive, including an integral without a closed form solution. For *p*_{0} = ½ McNemar[18] noted that this yields a (two-sided) test statistic in a particularly simple form

$$\frac{{({N}_{+}-{N}_{-})}^{2}}{{N}_{+}+{N}_{-}}{~}_{as.}\phantom{\rule{thinmathspace}{0ex}}{\chi}_{1}^{2}$$

In reality, the situations where the sign test is to be applied are often more complicated, because the sign of an observation (or pair of observations) may be neither positive nor negative. When such ‘ties’ are present, the McNemar test[18] yields the same results for two distinctively different situations:

- either we may have nine positive and one negative observations out of a total of ten,
- or we have also nine positive and one negative observation, but also 9990 tied observations. A ‘tie’ might be a subject being heterozygous or, in Arbuthnot’s case, having gonadal mosaicism, the rare conditions, where a subject has cells with both XX and XY chromosomes.

Ignoring ties is often referred to as ‘correction for ties’. Still, one may feel uncomfortable with a result being ‘significant’, although only nine observations ‘improved’ and one even ‘worsened’, while 9,990 were (more or less) ‘unchanged’.

This observation has long puzzled statisticians. In fact, several versions of ‘the’ sign test have been proposed[20]. Originally, Dixon, in 1946[21], suggested that ties be split equally between the positive and negative signs, but in 1951[22] followed McNemar[18] in suggesting that ties be dropped from the analysis.

To explicate the discussion, we divide the estimated proportion of positive signs by their standard deviation to achieve a common asymptotic distribution. Let *N*_{−}, *N*_{+}, and *N*_{=} denote the number of negative, positive, and tied observations, respectively, among a total of *n*. The original sign test[21] can then be written as

$${T}^{*}=\frac{({N}_{+}+{p}_{0}{N}_{=})-{p}_{\mathrm{H}}n}{\sqrt{{p}_{0}(1-{p}_{0})n}}=\frac{({N}_{+}+{p}_{0}{N}_{=})/n-{p}_{0}}{\sqrt{{p}_{0}(1-{p}_{0})/n}}\underset{\left\{\text{if}{p}_{0}=\raisebox{1ex}{$1$}\!\left/ \!\raisebox{-1ex}{$2$}\right.\right\}}{=}\frac{{N}_{+}-{N}_{-}}{\sqrt{n}}{~}_{as.}\mathrm{N}(0,1)$$

and the alternative[18, 22] as

$$T=\frac{({N}_{+}+{p}_{0}{N}_{=})-{p}_{\mathrm{H}}n}{\sqrt{{p}_{0}(1-{p}_{0})(n-{N}_{=})}}=\frac{{{N}_{+}/(n-{N}_{=})-p}_{0}}{\sqrt{{p}_{0}(1-{p}_{0})/(n-{N}_{=})}}\underset{\left\{\text{if}\phantom{\rule{thinmathspace}{0ex}}{p}_{0}=\raisebox{1ex}{$1$}\!\left/ \!\raisebox{-1ex}{$2$}\right.\right\}}{=}\frac{{N}_{+}-{N}_{-}}{\sqrt{{N}_{+}+{N}_{-}}}{~}_{as.}\phantom{\rule{thinmathspace}{0ex}}\mathrm{N}(0,1)$$

The first and last term in the above series of equations show that the ‘correction for ties’ increases the test statistic by a factor of $\sqrt{n/(n-{N}_{=})}$, thereby yielding more ‘significant’ results. The center term is helpful in understanding the nature of the difference. *T*^{*} distributes the ties as the numbers of observations expected under H_{0} and, then uses the unconditional variance[21]. *T* excludes the ties from the analysis altogether, thereby ‘conditioning’ the test statistic.

The sign test is often applied when the distribution of the data (or differences of data) cannot be assumed to be symmetric. (Otherwise, the paired t-test would be a robust alternative[5].) Assume, for simplicity of the argument, that the differences follow a triangular distribution with a median of zero and a symmetric discretization interval with (band-) width 2*b* around the origin (Fig. 2.2).

Then, with *T*, ‘significance’ increases with the discretization bandwidth, i.e., with the inaccuracy of the measurements; the test statistic increases from 0.00 to 2.34 (Table 2.1). With *T*^{*}, in contrast, the estimate for the probability of a positive sign remains within a narrow limit between 0.4 and 0.5 and the test statistic never exceeds the value of 1.0.

To resolve the seeming discrepancy between theoretical results suggesting a specific treatment of ties as ‘optimal’ and the counterintuitive consequences of using this strategy, one needs to consider the nature of ties. In genetics, for instance, tied observations can often be assumed to indicate identical phenomena (e.g., mutation present or absent[23]). Often, however, a thorough formulation of the problem refers to an underlying continuous or unmeasurable factor, rather than the observed discretized variable. For instance, ties may be due to rounding of continuous variables (temperature, Fig. 2.2) or to the use of discrete surrogate variables for continuous phenomena (parity for fertility), in which case the unconditional sign test should be used. Of course, when other assumptions can be reasonably made, such as the existence of a relevance threshold[24], or a linear model for the paired comparison preference profiles[25], ties could be treated in many other ways[26].

Family-based association tests (FBAT) control for spurious associations between disease and specific marker alleles due to population stratification[27]. Thus, the transmission/disequilibrium test (TDT) for biallelic markers, proposed in 1993 by Spielman *et al.*[28] has become one of the most frequently used statistical methods in genetics. Part of its appeal stems from its computationally simple form (*b* − *c*)^{2}/(*b* + *c*), which resembles the conditional sign test (Section 2.1.3). Here, *b* and *c* represent the number of transmitted wild-type (P) and risk (Q) alleles, respectively, whose parental origin can be identified in affected children.

Let *n _{XY}* denote the number of affected

To compare “*the frequency with which [among heterozygous parents] the associated allele P or its alternate Q is transmitted to the affected offspring*”[28], the term *b* − *c* in the numerator of the TDT can be decomposed into the contributions from families stratified by the parental mating types

$$b-c={\left[{n}_{\text{PQ}}-{n}_{\text{QQ}}\right]}_{\text{PQ}~\text{QQ}}+{\left[2({n}_{\text{PP}}-{n}_{\text{QQ}})+({n}_{\text{PQ}}-{n}_{\text{PQ}})\right]}_{\text{PQ}~\text{PQ}}+{\left[{n}_{\text{PP}}-{n}_{\text{pq}}\right]}_{\text{PP}~\text{PQ}}$$

Of course [*n*_{PQ}−*n*_{PQ}]_{PQ~PQ} = 0, so that this equation can be rewritten as[31]

$$b-c={\left[{n}_{\text{PQ}}-{n}_{\text{QQ}}\right]}_{\text{PQ}~\text{QQ}}+2{\left[{n}_{\text{PP}}-{n}_{\text{QQ}}\right]}_{\text{PQ}~\text{PQ}}+{\left[{n}_{\text{PP}}-{n}_{\text{PQ}}\right]}_{\text{PP}~\text{PQ}}.$$

(2.1)

As an ‘exact tie’[29, 30], the term [*n*_{PQ}−*n*_{PQ}]_{PQ~PQ} can be as ignored when computing the variance of (2.1), because they are as non-informative as the alleles transmitted from homozygous parents.

As noted above, independence of observations is a key concept in building test statistics. While “*the contributions from both parents are independent*”[28], the observations are not. Because the effects of the two alleles transmitted to the same child are subject to the same genetic and environmental confounders, one does not have a sample of independently observed alleles to build the test statistic on, but “*a sample of n affected children*”[28]. As a consequence, the factor ‘2’ in (2.1) does not increase the sample size[32], but is a weight indicating that the PP and the QQ children are ‘two alleles apart’, which implicates a larger risk difference under the assumption of co-dominance. Each of the three components in (2.1) follows a binomial distribution with probability of success equal to ½, so that the variance of (2.1) follows from “*the standard approximation to a binomial test of the equality of two proportions*”[28]

$${\sigma}_{0}^{2}(b-c)={\left[{n}_{\text{PQ}}+{n}_{\text{QQ}}\right]}_{\text{PQ}~\text{QQ}}+4{\left[{n}_{\text{PP}}-{n}_{\text{QQ}}\right]}_{\text{PQ}~\text{PQ}}+{\left[{n}_{\text{PP}}-{n}_{\text{PQ}}\right]}_{\text{PP}~\text{PQ}}.$$

(2.2)

Dividing the square of estimate (2.1) by its variance (2.2) yields a stratified McNemar test (SMN) which combines the estimates and variances of three McNemar tests in a fashion typical for stratified tests [33].

$$\frac{{(b-c)}^{2}}{{\sigma}_{0}^{2}(b-c)}=\frac{{\left({\left[{n}_{\text{PQ}}-{n}_{\text{QQ}}\right]}_{\text{PQ}~\text{QQ}}+2\phantom{\rule{thinmathspace}{0ex}}{\left[{n}_{\text{PP}}-{n}_{\text{QQ}}\right]}_{\text{PQ}~\text{PQ}}+{\left[{n}_{\text{PP}}-{n}_{\text{PQ}}\right]}_{\text{PP}~\text{PQ}}\right)}^{2}}{{\left[{n}_{\text{PQ}}-{n}_{\text{QQ}}\right]}_{\text{PQ}~\text{QQ}}+4\phantom{\rule{thinmathspace}{0ex}}{\left[{n}_{\text{PP}}-{n}_{\text{QQ}}\right]}_{\text{PQ}~\text{PQ}}+{\left[{n}_{\text{PP}}-{n}_{\text{PQ}}\right]}_{\text{PP}~\text{PQ}}}{~}_{as.}\text{}{\chi}_{1}^{2}$$

(2.3)

(‘Stratification’, i.e., blocking children by parental mating type, here is merely a matter of computational convenience. Formally, one could treat each child as a separate block, with the same results.) The TDT, in contrast, divides the same numerator by a “*variance estimate*”[28]

$${\widehat{\sigma}}_{0}^{2}(b-c)=b+c={\left[{n}_{\text{PQ}}+{n}_{\text{QQ}}\right]}_{\text{PQ}~\text{QQ}}+\left\{\begin{array}{c}2{\left[{n}_{\text{PP}}+{n}_{\text{QQ}}\right]}_{\text{PQ}~\text{PQ}}\\ +2{\left[{n}_{\text{PQ}}\right]}_{\text{PQ}~\text{PQ}}\end{array}\right\}+{\left[{n}_{\text{PP}}+{n}_{\text{pQ}}\right]}_{\text{PP}~\text{PQ}}$$

(2.4)

which relies on the counts of non-informative PQ children. Replacing the observed *n*_{PP}+*n*_{QQ} by its estimate *n*_{PQ} under *H*_{0} would require an adjustment, similar to the replacement of the Gaussian by the t-distribution when using the empirical standard deviation with the t-test[34]. Hence the SMN has advantages over the TDT for finite samples[31], in part, because it has a smaller variance (more and, thus, lower steps, see Fig. 2.3) under *H*_{0} (see Table 2.2).

Because the TDT estimates the variance under the assumption of co-dominance, it overestimates the variance (has lower power) for alleles with high dominance, i.e., when being heterozygous for the risk allele carries (almost) as much risk as being homozygous[31]. On the other hand, the TDT underestimates the variance (yielding higher power) for recessive alleles[31].

Understanding the role of parental mating types, the sensitivity of the SMN can be easily focused on either dominant or recessive alleles. When screening for recessive alleles, one excludes trios where one parent is homozygous for the wild type and assigns equal weights to the remaining trios[35]. Conversely, when screening for dominant alleles, one excludes trios where one parent is homozygous for putative risk allele.

A better understanding of the statistical principles underlying family-based associations studies not only leads to test statistics with better small sample properties and better targeted tests for alleles with low or high dominance but also to extensions which open new areas of applications.

For decades it has been known that the HLA-DRB1 gene is a major factor in determining the risk for multiple sclerosis[36]. As HLA-DRB1 is one of the few genes where more than two alleles per locus are routinely observed, the SMN is easily generalized to multi-allelic loci by increasing the number of parental mating-type strata and identifying within each stratum the number of informative children[35]. In early 2009, applying the extension of the SMN for multi-allelic loci (Table 2.3) allowed the narrowing down of risk determinants to amino acid 13 at the center of the HLA-DRB1 P4 binding pocket, while amino acid 60, which had earlier been postulated based on structural features[37], was seen as unlikely to play a major role [35, 38] (Fig. 2.4).

Another simple method based on u-statistics is the median on which the Bioconductor package Harshlight is based, a program to identify and mask artifacts on Affymetrix microarrays[35, 39, 40]. After extensive discussion[41, 42], this approach was recently adopted for Illumina bead arrays as “*BASH: a tool for managing BeadArray spatial artifacts*”[43] (Fig. 2.5).

Fig. 2.6 demonstrates the relationships between the different versions of the sign and McNemar tests and how easily even their exact versions, where all possible permutations of data need to be considered, can be implemented in code compatible with R and S-PLUS. For dominant and recessive alleles, the parameters wP/wQ are set to .0/.5, and .5/.0, respectively. HarshLight and BASH (as part of the beadarray package) are available from http://bioconductor.org.

Our aim is to first develop a computationally efficient procedure for scoring data based on ordinal information only. We will not make any assumptions regarding the functional relationship between variable and the latent factor of interest, except that the variable has an orientation, i.e., that an increase in this variable is either ‘good’ or ‘bad’. Without loss of generality, we will assume that for each of the variables ‘more’ means ‘better’.

Here, the index *k* is used for subjects. Whenever this does not cause confusion, we will identify patients with their vector of *L* ≥ 1 observations to simplify the notation. The scoring mechanism is based on the principle that each patient {*x _{k}* = (

$$\mathrm{u}({x}_{k})={\sum}_{k\prime}\mathrm{I}\left({x}_{k\prime}<{x}_{k}\right)-{\sum}_{k\prime}\mathrm{I}\left({x}_{k\prime}>{x}_{k}\right)$$

(2.5)

When Gustav Deuchler[44] in 1914 developed what more than 30 years later should become known as the Mann–Whitney test[11] (although he missed one term in the asymptotic variance), he proposed a computational scheme of striking simplicity, namely to create a square table with the data as both the row and the column headers (Fig. 2.7). With this, the row sums of the signs of paired comparisons yields the u-score.

Since the matrix is symmetric, only one half (minus the main diagonal) need to be filled. Kehoe and Cliff[46] pointed out even more redundancies. The interactive program Interord computes which additional entries could be implied by transitivity from the entries already made.

When several conditions are to be compared, the key principle to be observed is that of randomization. In most situations, randomization is the only practical strategy to ensure that the results seen can be causally linked to the interventions to be compared.

When observational units vary, it is often useful to reduce the effect of confounding variables through stratification. A special case of stratification is the sign test, when two interventions are to be compared within a stratum of two closely related subjects (e.g., twins). Here we will consider situations where more than two subjects form a stratum, e.g., where subjects are stratified by sex, race, education, or the presence of genetic or environmental risk factors. The fundamental concept underlying the sign test is easily generalized to situations where a stratum contains more than two observations.

Under *H*_{0}, the expectation of ${T}_{j}={\displaystyle {\sum}_{k=1}^{{M}_{j}}\mathrm{u}}\left({x}_{jk}\right)$ is zero. For the two-sample case, Mann and Whitney, in 1947 [11], re-invented the ‘u-test’ (this time with the correct variance). As u-scores are a linear function of ranks (*u* = 2*r* − (*n* + 1)), their test is equivalent[47] to the rank-sum test proposed in 1954 by Wilcoxon[10]. The Wilcoxon–Mann/Whitney (WMW) test, in turn, is a special case of the 1952 Kruskal–Wallis (KW) rank test for >2 groups[12]. The notation used below will prove to be particularly useful to generalize these results.

The test statistic *W*_{KW} is based on the sum of u-scores ${U}_{j}={\displaystyle {\sum}_{k=1}^{{M}_{j}}\mathrm{u}}\left({X}_{jk}\right)$ within each group *j* = 1, …, *p*. It can be computed as a ‘quadratic form’

$${W}_{\text{KW}}=\mathbf{U}\prime {\mathbf{V}}^{-}\mathbf{U}={\displaystyle {\sum}_{j,j\prime =1}^{p}{U}_{j}{v}_{jj\prime}^{-}{U}_{j\prime}}{~}_{as.}\text{}{\chi}_{p-1}^{2}$$

(2.6)

where **V**^{−} is a generalized inverse of the variance-covariance matrix **V**. For the KW test, the matrix **V**, which describes the experimental design and the variance among the scores are:

$${\mathbf{V}}_{\text{KW}}={s}_{0}^{2}((\begin{array}{lll}{M}_{1}\hfill & \hfill & \hfill \\ \hfill & \hfill \hfill & \hfill & \hfill & {M}_{p}\hfill \end{array})-(\begin{array}{lll}\frac{{M}_{1}{M}_{1}}{{M}_{+}}\hfill & \hfill & \frac{{M}_{1}{M}_{p}}{{M}_{+}}\hfill \\ \hfill & \hfill \hfill & \frac{{M}_{p}{M}_{1}}{{M}_{+}}\hfill & \hfill & \frac{{M}_{p}{M}_{p}}{{M}_{+}}\hfill \end{array}))\phantom{\rule{thinmathspace}{0ex}},{s}_{0}^{2}=\frac{1}{{M}_{+}-1}\{\begin{array}{cc}{\displaystyle {\sum}_{x=1}^{{M}_{+}}{\mathrm{u}}^{2}}\left(x\right)& \text{unconditional}\\ {\displaystyle {\sum}_{j=1}^{p}{\displaystyle {\sum}_{k=1}^{{M}_{j}}{\mathrm{u}}^{2}}}\left({X}_{jk}\right)& \text{conditional}\phantom{\rule{thinmathspace}{0ex}}\text{on}\phantom{\rule{thinmathspace}{0ex}}\text{ties}\end{array}.$$

(2.7)

Traditionally, the WMW and KW tests have been presented with formulae designed to facilitate numerical computation at the expense of conceptual clarity:

$${\mathbf{V}}_{\text{KW}}^{-}=\frac{1}{{s}_{0}^{2}}(\begin{array}{lll}\frac{1}{{M}_{1}}\hfill & \hfill & \hfill \\ \hfill & \hfill \hfill & \hfill & \hfill & \frac{1}{{M}_{p}}\hfill \end{array})\phantom{\rule{thinmathspace}{0ex}},{s}_{0}^{2}=\frac{{M}_{+}\left({M}_{+}+1\right)}{3}\{\begin{array}{cc}1& \text{unconditional}\\ 1-{\displaystyle {\sum}_{g=1}^{G}\left({W}_{g}^{3}-{W}_{g}\right)/}\left({M}_{+}^{3}-{M}_{+}\right)& \text{conditional}\phantom{\rule{thinmathspace}{0ex}}\text{on}\phantom{\rule{thinmathspace}{0ex}}\text{ties}\end{array}$$

where *W _{g}* indicates the size of the

$${W}_{\text{KW}}=\frac{3}{{M}_{+}({M}_{+}+1)}{\displaystyle \sum _{j=1}^{p}\frac{{T}_{j}^{2}}{{M}_{j}}}{~}_{as.}\text{}{\chi}_{p-1}^{2}.$$

With the now abundant computer power one can shift focus to the underlying design aspects and features of the score function represented in the left and right part of (2.7), respectively. In particular, the form presented in (2.6) can be more easily extended to other designs (below).

To extend u-tests to stratified designs, we add an index *i* = 1, …, *n* for the blocks to be analyzed. With this, the (unconditional) sign test can be formally rewritten as

$$\begin{array}{c}{\mathbf{V}}_{\text{ST}}=n{s}_{0}^{2}\left(\left(\begin{array}{ll}1\hfill & 0\hfill \\ 0\hfill & 1\hfill \end{array}\right)-\left(\begin{array}{ll}\raisebox{1ex}{$1$}\!\left/ \!\raisebox{-1ex}{$2$}\right.\hfill & \raisebox{1ex}{$1$}\!\left/ \!\raisebox{-1ex}{$2$}\right.\hfill \\ \raisebox{1ex}{$1$}\!\left/ \!\raisebox{-1ex}{$2$}\right.\hfill & \raisebox{1ex}{$1$}\!\left/ \!\raisebox{-1ex}{$2$}\right.\hfill \end{array}\right)\right)=n{s}_{0}^{2}\left(\mathbf{I}-\frac{1}{2}\mathbf{J}\right)\phantom{\rule{thinmathspace}{0ex}},\text{}{\mathbf{V}}_{\text{ST}}^{-}=\frac{1}{n{s}_{0}^{2}}\mathbf{I}\phantom{\rule{thinmathspace}{0ex}},\text{}{s}_{0}^{2}={\displaystyle {\sum}_{j=1}^{2}{\mathrm{u}}^{2}}(j)=2.\\ {W}_{\text{ST}}=\frac{1}{2n}{\displaystyle \sum _{j=1}^{2}{T}_{j}^{2}}=\frac{1}{n}{\left({N}_{+}-{N}_{-}\right)}^{2}{~}_{as.}\text{}{\chi}_{p-1}^{2}\end{array}$$

Coming from a background in econometrics, the winner of the 1976 Nobel prize in economics, Milton Friedman, had presented a similar test[48] in 1937, yet for more than two conditions.

$$\begin{array}{c}{\mathbf{V}}_{\text{FM}}=n{s}_{0}^{2}((\begin{array}{lll}1\hfill & \hfill & \hfill \\ \hfill & \hfill \hfill & \hfill & \hfill & 1\hfill \end{array})-(\begin{array}{lll}\frac{`1}{p}\hfill & \hfill & \frac{1}{p}\hfill \\ \hfill & \hfill \hfill & \frac{1}{p}\hfill & \hfill & \frac{1}{p}\hfill \end{array}))=n{s}_{0}^{2}\left(\mathbf{I}-\frac{1}{p}\mathbf{J}\right)\phantom{\rule{thinmathspace}{0ex}},\text{}{\mathbf{V}}_{\text{FM}}^{-}=\frac{1}{n{s}_{0}^{2}}\mathbf{I}\phantom{\rule{thinmathspace}{0ex}},\text{}{s}_{0}^{2}=\frac{1}{p-1}{\displaystyle {\sum}_{j=1}^{p}{\mathrm{u}}^{2}}(j)=\frac{p(p+1)}{3}.{W}_{\text{FM}}=\frac{3}{np(p+1)}{\displaystyle {\sum}_{j=1}^{p}{T}_{+j}^{2}}{~}_{as.}\text{}{\chi}_{p-1}^{2}\end{array}$$

In either case, non-informative blocks are excluded with ‘conditioning on ties’.

In the field of voting theory, the blocks represent voters and the u-scores are equivalent to the Borda counts, originally proposed in 1781[49], around the time of Arbuthnot’s contributions, although the concept of summing ranks across voters was already introduced by Ramon Llull (1232–1315)[50].

In the 1950s[51, 52], it was demonstrated that the Kruskal–Wallis[12] and Friedman[48] tests, and also the tests of Durban[53] and Bradley–Terry[54] can be combined and extended by allowing for both several strata and several observations within each cell (combination of stratum and group).

However, when blocks represent populations of different size (proportionate to *m _{i}*, say) and/or have missing data (

$$\mathrm{u}\left({x}_{ijk}\right)={\displaystyle {\sum}_{j\prime k\prime}\mathbf{I}}\left({x}_{ij\prime k\prime}<{x}_{ijk}\right)-{\displaystyle {\sum}_{j\prime k\prime}\mathbf{I}}\left({x}_{ij\prime k\prime}>{x}_{ijk}\right)\phantom{\rule{thinmathspace}{0ex}}\phantom{\rule{thinmathspace}{0ex}}{U}_{ij}=\frac{{m}_{i}+1}{{M}_{i}+1}{\displaystyle {\sum}_{k=1}^{{M}_{ij}}\mathrm{u}}\left({X}_{ijk}\right)$$

$$\begin{array}{l}{\mathbf{V}}_{i}={s}_{i0}^{2}\phantom{\rule{thinmathspace}{0ex}}((\begin{array}{lll}{M}_{i1}\hfill & \hfill & \hfill \\ \hfill & \hfill \hfill & \hfill & \hfill & {M}_{ip}\hfill \end{array})-(\begin{array}{lll}\frac{{M}_{i1}{M}_{i1}}{{M}_{i+}}\hfill & \hfill & \frac{{M}_{i1}{M}_{ip}}{{M}_{i+}}\hfill \\ \hfill & \hfill \hfill & \frac{{M}_{ip}{M}_{i1}}{{M}_{i+}}\hfill & \hfill & \frac{{M}_{ip}{M}_{ip}}{{M}_{i+}}\hfill \end{array}))\phantom{\rule{thinmathspace}{0ex}},\text{}{s}_{i0}^{2}=\frac{1}{{M}_{i+}-1}{\left(\frac{{m}_{i+}+1}{{M}_{i+}+1}\right)}^{2}\phantom{\rule{thinmathspace}{0ex}}\{\begin{array}{cc}{\displaystyle {\sum}_{x=1}^{{M}_{i+}}{\mathrm{u}}^{2}}\left(x\right)& \text{unconditional}\\ {\displaystyle {\sum}_{j=1}^{p}{\displaystyle {\sum}_{k=1}^{{M}_{ij}}{\mathrm{u}}^{2}}}\left({X}_{ijk}\right)& \text{conditional}\phantom{\rule{thinmathspace}{0ex}}\text{on}\phantom{\rule{thinmathspace}{0ex}}\text{ties}\end{array}.\end{array}$$

(2.8)

In contrast to the above special cases, no generalized inverse with a closed form solution is known, so that one has to rely on a numerical solution:

$$W={\mathbf{U}\prime}_{+}{\mathbf{V}}_{+}^{\phantom{\rule{thinmathspace}{0ex}}-}{\mathbf{U}}_{+}={\displaystyle {\sum}_{i=1}^{n}{\displaystyle {\sum}_{j,j\prime =1}^{p}{U}_{ij}}}{v}_{jj\prime}^{-}{U}_{ij\prime}{~}_{as.}\text{}{\chi}_{p-1}^{2}$$

(2.9)

When the conditions to be compared are genotypes, a situation may arise, where the genotypes of some subjects are not known. Still, phenotype information from those subjects can be used when computing the scores by assigning these subjects to a pseudo group *j* = 0, which is used for the purpose of scoring only.

Genome-wide association studies (GWAS) often aim at finding a locus, where the proportions of alleles differ between cases and controls. For most human single nucleotide polymorphisms (SNPs), only two alleles have been seen. Thus, the data can be organized in a 2×2 table,

Control | Cases | ||
---|---|---|---|

Allele 0 | ${M}_{`1}^{(0)}$ | ${M}_{2}^{(0)}$ | ${M}_{+}^{(0)}$ |

Allele 1 | ${M}_{1}^{(1)}$ | ${M}_{2}^{(1)}$ | ${M}_{+}^{(1)}$ |

M_{1} | M_{2} | M_{+} |

While data for the sign test can also be arranged in a 2×2 table, the question here is not, whether ${M}_{1}^{(1)}\ne {M}_{2}^{(0)}$, but whether ${M}_{1}^{(1)}/{M}_{1}\ne {M}_{2}^{(1)}/{M}_{2}$. Thus, the appropriate test here is not the sign test, but Fisher’s exact or, asymptotically, the χ^{2} test for independence or homogeneity. This χ^{2} test is asymptotically equivalent to the WMW test for two response categories. As with the sign test, ties are considered ‘exact’ in genetics and, thus, the variance conditional on the ties is applied.

When the data are stratified, e.g., by sex, one obtains a 2×2 table for each block and the generalization of the χ^{2} test is known as the Cochran–Mantel–Haenzel (CMH) test, which can also be seen as a special case of the W-test

$${W}_{\text{CMH}}=\frac{{\left({\displaystyle {\sum}_{i-1}^{n}\frac{{m}_{i+}+1}{{M}_{i+}+1}{M}_{i1}{M}_{i2}({P}_{i1}-{P}_{i2})}\right)}^{2}}{{\displaystyle {\sum}_{i-1}^{n}{\left(\frac{{m}_{i+}+1}{{M}_{i+}+1}\right)}^{2}\frac{{M}_{i+}^{2}}{{M}_{i+}-1}{M}_{i1}{M}_{i2}{P}_{i+}(1-{P}_{i+})}}=\frac{{\left({\displaystyle {\sum}_{i-1}^{n}\frac{{m}_{i+}+1}{{M}_{i+}+1}({M}_{i2}^{(1)}{M}_{i1}^{(0)}-{M}_{i1}^{(1)}{M}_{i2}^{(0)})}\right)}^{2}}{{\displaystyle {\sum}_{i-1}^{n}{\left(\frac{{m}_{i+}+1}{{M}_{i+}+1}\right)}^{2}\frac{1}{{M}_{i+}-1}}{M}_{i+}^{(1)}{M}_{i+}^{(0)}{M}_{i1}{M}_{i2}}{~}_{as.}\text{}{\chi}_{1}^{2}$$

For most statistical applications, the degrees of freedom for χ^{2} tests is the rank of the variance-covariance matrix **V**. For genetic and genomic screening studies, however, where the dimension of the variance-covariance matrix may vary depending on the number of groups present for a particular gene or locus. Clearly, it would be inappropriate in a screening study to decrease the degrees of freedom (df) for the χ^{2} distribution if one (or more) of the groups is (are) missing, but revert to the full df when a single observation for this group is available for another gene or locus.

Among the many obstacles that have prevented statistical methods based on u-scores from being used more widely is that they are traditionally presented as an often confusing hodgepodge of procedures, rather an a unified approach backed by a comprehensive theory. For instance, the Wilcoxon rank sum and the Wilcoxon signed rank test are easily confused. Both were published in the same paper[10], but they are not special cases of a more general approach. The Lam–Longnecker test[58], on the other hand, directly extends the WMW to paired data. Finally, the Wilcoxon rank sum and the Mann–Whitney u test are equivalent[47], although they were independently developed based on different theoretical approaches.

Above, we have demonstrated that several rank or u-tests can be presented as a quadratic form $W={\mathbf{U}}_{+}^{\prime}{\mathbf{V}}_{+}^{\phantom{\rule{thinmathspace}{0ex}}-}{\mathbf{U}}_{+}{~}_{as.}\text{}{\chi}_{p-1}^{2}$, where the variance-covariance matrix **V**_{+} reflects the various designs. The `muStat` package (available for both R and S-PLUS) takes advantage of this fact, both by providing the unified function `mu.test()` (in addition to replacement functions of the original R and S-PLUS functions, for compatibility) and by being based on a single copy of code (Fig. 2.8), although internally an optimized version may be used for special cases. (The case of censored data will be handled as a special case of multivariate data, see Section 4.2.1).

To ensure that p-values are comparable within a study (see Section 3.3.2), the function `mu.test()` allows the number of degrees to be fixed (`df>0`) or to be determined by the design matrix (`df=0`), in addition to it being computed from the observed data (`df=−1`).

Few biological systems can be sufficiently characterized by a single variable only. A single measure often does not appropriately reflect the effect of all relevant genetic or environmental risk factors, clinical or epidemiological interventions, or personal preferences. Sometimes the definite measure is not easily obtained, so that several surrogate measures need to be evaluated. At other times, e.g., when assessing a complex syndrome or a chronic disease, a definite measure may not even exist.

Still, most statistical methods for multivariate data are based on the (generalized) linear model, either explicitly, as in regression, factor, discriminant, and cluster analysis, or implicitly, as in neural networks. One scores each variable individually on a comparable scale, either present/absent, low/intermediate/high, 1–10, or z-transformation, and then defines a global score as the linear combination (weighted average) of these scores. Thus, it is assumed that it is known how to transform each variable to a common scale, so that a weighted average of these transformed variables can be meaningfully interpreted and that these weights are constant.

The linear model became popular mainly because its mathematical elegance lead to computational efficiency and parameters of alluring simplicity. When applied to real-world data, however, this approach may have shortcomings, because biological systems are typically regulated by various, often unknown, feedback loops so that the functional form of relationships between measurement and activity or efficacy is typically unknown, except within narrowly controlled experiments. Since the relative importance of the variables, the correlation among them, and the functional relationship of each variable with the immeasurable latent factor ‘safety’, ‘activity’ or ‘effectiveness’ are typically unknown, construct validity[59] cannot be established on theoretical grounds and one needs to resort to empirical ‘validation’, choosing weights and functions to provide a reasonable fit with a ‘gold standard’ when applied to a sample, a process of questionable validity by itself[60]. The Delphi oracle, where women intoxicated by fumes predicted the future was often difficult to interpret. The ‘Delphi method’[61] approach to scoring systems, where weights and functions are agreed upon by a group of experts may facilitate comparison between studies, yet comparability along a scale with questionable validity may still yield questionable results. The diversity of scoring systems used attests to the subjective nature of this process.

Non-parametric methods, in general, are designed for ordinal data, where a one-unit difference may not carry the same ‘meaning’ across the range of possible values[62], and, thus, avoid artifacts created by making unrealistic assumptions. Consequently, non-parametric methods are particularly well suited for analyses in human phenomics[63]. The marginal likelihood principle (MrgL) provides a general framework to extended rank tests to missing data[33, 64–66]. In 1992, it was demonstrated that MrgL procedures for censored data can be generalized to cover multivariate ordinal observations[67], in general. While this approach proved eminently useful[68–72], the computational effort was prohibitive.

Drawing on the analogy of the Wilcoxon rank sum[10] (MrgL) with the Mann–Whitney test[73] (u-statistics), a 1914 algorithm[44] yielded a computationally more efficient approach, which can easily be extended to more complex partial orderings[62, 74] and designs[31, 33, 75]. Below, it will be demonstrated how µ-scores (u-scores for multivariate data) cover situations where a one-unit difference may carry a different ‘meaning’ across variables[76] and, thus, can integrate information even when the events counted are incomparable or the variables’ scales differ, as long as each variable has the same ‘orientation’.

We will now extend the notation introduced in Section 3.1 (using *i* = 1, …, *n* for blocks, *j* = 1, …, *p* for groups, and *k* = 1, …, *m _{ij}* for replications) by adding an additional index = 1, …,

$${\left\{{x}_{jk}=\left({x}_{jk1},\dots ,{x}_{jkL}\right)\text{'}\right\}}_{j=1,\dots ,p;k=1,\dots {m}_{j}}$$

is compared to every other subject in a pairwise manner. For stratified designs, these comparisons will be made within each stratum only.

When the observed outcomes can be assumed to be correlated with an unobservable latent factor, a partial ordering[67] among the patients can easily be defined. If the second of two patients has values higher in at least one variable and at least as high among all variables = 1, …, *L* (or equivalently, lower in none), it will be called ‘superior’:

$${x}_{jk}<{x}_{j\prime k\prime}\{{\forall}_{\mathrm{=1,\dots ,L}}$$

(3.1)

Many partial orderings can be defined, which, by definition, are transitive (*a* < *b*) ∧ (*b* < *c*) (*a* < *c*). Orderings like (3.1), which treat all variables equally, are called ‘regular’. The partial ordering for an (interval) censored variable is just one example. Several more examples will be given below.

Even though a partial ordering does not guarantee that all patients can be ordered on a pairwise basis, all patients can be scored. One can assign a score in exactly the same fashion as described in (2.5), using the partial order (3.1) instead of the simple univariate order. By definition, µ-scores are ‘intrinsically valid’, i.e., independent of the choice of (non-zero) weights and (monotonous) transformations assigned to the variables.

Many phenotypes are ‘censored’. A typical case is ‘survival’, where some subjects may still be alive at the end of a study. Similarly, a subject may be surveyed as not yet having cancer. Other examples are events that are not directly observable, like the recurrence of cancer, where often only the last date with a negative and the first date with a positive result are known. With genetic studies, censored information arises, for instance, when the prevalence of a disease that develops over time is observed in subjects that vary in age, such as in late-onset diabetes, cardiovascular diseases, cancer, etc. For such studies, a negative observation merely means that this subject does not yet have developed the disease, rather than that the subject is ‘immune’.

In 1965, Gehan[77, 78] demonstrated how u-score can be applied to censored (including interval-censored) variables (see Fig. 2.9). Assuming that the variables represent the last date negative (LDN) and the first date positive (FDP), Subject A experiences the event under investigation ‘later’ than subject B if LDN(A) > FDP(B).

In clinical or epidemiological research, such data may arise when the exact date of an event, e.g., infection or recurrence, is not known, but the event is known to have happened between the date of the last negative test *x*_{jk1} and the date of the first positive test *x*_{jk2}. Right censored data are a special case (*x*_{jk2} = *x*_{jk1}: event, *x*_{jk2} = ∞: censoring). Thus, pairs of intervals can be ordered, if they do not overlap. For left- and right-censored observations, LDN and FDP are −∞ and +∞, respectively.

Tests for censored data are often presented in an equivalent form, where the first variable is the time point, and the second is the ‘censoring indicator’. The above representation, however, is more easily generalized, e.g., to ‘interval censored’ data, where *x*_{jk1} < *x*_{jk2} < ∞.

To further clarify the relation between censored and multivariate data, it is convenient to consider the most general case, interval censored observations. In clinical or epidemiological research, such data may arise when the exact date of an event, e.g., infection or recurrence, is not known, but the event is known to have happened between the date of the last negative test *x*_{jk1} and the date of the first positive test *x*_{jk2}. Right censored data are a special case (*x*_{jk2} = *x*_{jk1}: event, *x*_{jk2} = ∞: censoring). Thus, pairs of intervals can be ordered, if they do not overlap, or, equivalently, if both time points in one subject are earlier than both time points in the other subject:

Schemper[79] generalized the Gehan test to more than two groups. Also, from the work of Hoeffding[73] and Lehmann[80], u-scores for censored data can be analyzed for all the stratified designs mentioned above.

If only a single variable needs to be considered (*L* = 1) and all observations are different, the order is complete (Fig. 2.10a). If identical observations (ties) are present, two cases need to be considered, as with the sign test, above. On the one hand, ties may be due to the underlying phenomena and on the other hand may be caused by discretization or by observing a discrete surrogate variable for a continuous phenomenon. In both cases, there are three possibilities for each pair of patients. In the former case, they are <, >, or ‘=’ (Fig. 2.10b), in the latter, where ties reflect some ambiguity[29], they are ‘<’, ‘>’, or ‘’ (Fig. 2.10c). Intervals, however, can only be ordered, if they are disjoint, so that some paired comparisons may be undetermined. In Fig. 2.10d, for instance, it is not known, if the patient infected between the first and the third follow-up visit (1..3) was infected earlier than the patient infected between the second and the third visit (2..3). The same rationale applies to situations with several (*L* > 1) variables (Fig. 2.10e). For multivariate data, the order between two subjects is undetermined if *x _{jk}* <

From Fig. 2.10d and e, the Gehan test[77, 78] can be easily extended to multivariate data, using the approach originally suggested by Hoeffding[81], because it relies on the existence of a partial order only, irrespective of how that partial ordering was created.

As an alternative to the ‘weak’ partial ordering (3.1), one might require its ‘strong’ cousin [82]

$${x}_{jk}\phantom{\rule{thinmathspace}{0ex}}\xab\phantom{\rule{thinmathspace}{0ex}}{x}_{j\prime k\prime}\phantom{\rule{thinmathspace}{0ex}}\phantom{\rule{thinmathspace}{0ex}}\{{\forall}_{\mathrm{=1,\dots ,L}}$$

(3.2)

At first sight, the strong order (3.2) may seem more appropriate for discretized variables, because the true order of the observations discretized into a tie is unknown. If each variable is presumed to be a surrogate for the same latent factor, however, the strong partial order highlights a potential downside of µ-scores, namely that the number of ambiguous paired comparisons increases with the number of variables, unless the variables are highly correlated. As a result, a larger sample may be needed to achieve the desired power. Allowing for ties to be broken by observations in other variables again yields the weak partial ordering. Thus, the weak regular partial ordering (3.1) will be called ‘natural’ for applications where each variable can be assumed to be a surrogate for the same underlying latent factor.

Formally, we will define the proportion of paired comparison with subject A that can be decided as the ‘information content’ of subject A’s score. Unless the variables are highly correlated, so that adding more variables merely breaks ties, information content often decreases with the number of variables and the rigidity of the partial ordering. On the other hand, the more information about the underlying models is available to justify combining several variables into a linear combination, the more paired comparisons can be decided.

Below, we will discuss two non-parametric approaches to increase information content by reflecting prior knowledge. The first is to transform the data using less stringent assumptions than those imposed by the linear model. Then, we will demonstrate how allowing for hierarchical structures among variables can increase information content when variables are known to belong to different factors (e.g., safety and efficacy) or sub-factors[45, 83]. In turn, identifying structures that maximize information content then leads to a non-parametric alternative to traditional factor analyses, including more efficient questionnaires.

Deuchler’s[44] univariate algorithm to depict a complete ordering as a symmetric matrix is easily extended to partial and imperfect (non-transitive) orderings for data profiles comprising several variables. Morales *et al.*[45] separated this process into two steps, as outlined in Fig. 2.11. First, Deuchler’s[44] univariate paired comparisons are represented as the matrices ${U}^{()}$ (middle row) with ${u}_{kk\prime}^{()}$ or ${x}_{k\prime}^{()}$ are missing and ${u}_{kk\prime}^{()}$ otherwise [76], extending Deuchler’s[44] algorithm to imperfect (not necessarily partial) orderings allowing for missing data.

The matrix *U* obtained by the ‘AND’ operation

$$U={i}_{{U}^{()}=(({u}_{kk\prime}^{()}))\phantom{\rule{thinmathspace}{0ex}},\phantom{\rule{thinmathspace}{0ex}}\text{where}\phantom{\rule{thinmathspace}{0ex}}{u}_{jj\prime}=\{\begin{array}{cc}1\hfill & \mathrm{:{u}_{kk\prime}^{()}0\hfill & \mathrm{:{u}_{kk\prime}^{()}-1\hfill & i:{u}_{kk\prime}^{(i)}=-1\phantom{\rule{thinmathspace}{0ex}}\wedge \phantom{\rule{thinmathspace}{0ex}}\forall i:{u}_{kk\prime}^{(i)}\ne 1\hfill & ?& \text{otherwise}}\hfill}\hfill \end{array}}$$

with ‘?’ indicating ambiguity, is the same as the matrix one would obtain by applying the imperfect ordering defined earlier[76] and, thus, the scores obtained from *U* = _{}*U*^{()} (bottom of Fig. 2.11) are the non-hierarchical µ-scores.

Recently, Cherchye and Vermeulen[84] proposed to replace the matrix of paired comparisons, which has entries ‘+1’, ‘0’, ‘?’, and ‘−1’, by a computationally simpler ‘greater-equal’ (GE) matrix of binary entries *u _{kk′}* = I(

When comparing the severity of ‘damage’ seen in ultrasound or radiologic images[85], for instance, it may not be possible to define all relevant variables. In these situations, one might present a rater with pairs of images (A, B) to be judged as A < B, A > B, or A B. As in the univariate case (see Section 3.1.2), an interactive system could then reduce the number of questions to be asked by inferring which paired comparisons ban be implied from the answers already made. If the number of pairs that each rater can be presented is to be limited, the unanswered questions would be considered ambiguous.

In the muStat package for R and S-PLUS packages are available from http://cran.r-project.org and http://csan.insightful.com, respectively, these steps are implemented as the functions `mu.PwO` and `mu.AND`, followed by computation of scores in information content (Fig. 2.12).

Using the asymptotic results of Hoeffding[73], the resulting µ-scores can then be analyzed with `mu.test` (Fig. 2.8) or through bootstrapping [86].

Often, a simple transformation of the variables may suffice to reflect additional knowledge. For graded variables, where one unit impacts less in a ‘lower grade’ than in a ‘higher grade’ variable[74, 84] one can split each value of variable () (sorted by grades) into the value of the lowest grade variable Δ_{} and incremental values of the higher grade variables Δ_{′=2…}.

$$\begin{array}{l}{x}_{k,(1)}={x}_{k,(1)}{\mathrm{\Delta}}_{1}\hfill \\ {x}_{k,(2)}={x}_{k,(2)}{\mathrm{\Delta}}_{1}+{x}_{k,(2)}{\mathrm{\Delta}}_{2}\hfill \\ \dots \hfill \\ {x}_{k,(L)}={x}_{k,(L)}{\mathrm{\Delta}}_{1}+{x}_{k,(L)}{\mathrm{\Delta}}_{2}+\dots +{x}_{k,(L)}{\mathrm{\Delta}}_{L}\hfill \end{array}$$

Thus, the profile of counts sorted by grade (*x*_{k,(1)}, *x*_{k,(2)}, …, *x*_{k,(L)}) can be expressed as the column sums (*x*_{k,(≥1)}, *x*_{k,(≥1)}, …, *x*_{k,(=L)}) where ${x}_{k,(\ge )}$. The partial ordering for graded variables

$$\left({x}_{k,(1)},\dots ,{x}_{k,(L)}\right)<\left({x}_{k\prime ,(1)},\dots ,{x}_{k\prime ,(L)}\right){\forall}_{\mathrm{=1}L}^{{x}_{k,(\ge )}}$$

(3.3)

is equivalent to the regular ordering (3.1) applied to the cumulative variables *x*_{k,(≥)}.

Although each profile’s outcomes are decomposed into additive components Δ_{}, substantially weaker assumptions are made than with linear weight (lw) scores, because the additive components can be unknown and may even differ between pairs, noting that for subjects far apart (subject *k* is lower than subject *k*' for each of the variables) or incomparable (some variables higher for subject *k* and some variables higher for subject *k*') the weights are irrelevant. Thus, the weights only need to be ‘locally similar’, rather than ‘globally constant’.

If the variables are related to different ‘factors’ and the order between subjects A and B is ambiguous with respect to variables related to one factor (e.g., genetics), unambiguous results with respect to another factor (e.g., environment) can ‘overwrite’ this ambiguity. The advantage of creating the matrices reflecting the univariate orderings first (`mu.PwO`) and combining them in a separate step (`mu.AND`) before computing the scores (`mu.Sum`), is that incorporating knowledge about the sub-factor hierarchy through hierarchically combining the matrices typically reduces loss of information content[45] (number of unambiguous paired comparisons contributing to a score).

If the variables X1 and X2 in Fig. 2.11 were related to the same ‘factor’, while the variables Y1 and Y2 are related to another ‘factor’, one could replace *U*_{NH} = _{}*U*^{()} by *U*_{H}={_{{:X1, X2}}*U*^{()}, {_{{:Y1, Y2}}*U*^{()}}. Fig. 2.13 demonstrates how reflecting the structure increases information content by resolving the ambiguity related to comparing A vs. D.

Fig. 2.13, shows that reflecting more hierarchical information can never decrease and typically increases information content, because it reduces the effect of ‘noise’ that may have caused paired comparisons within a factor to be ambiguous. If all ambiguities are resolved, the µ-scores become ranks, which are uniformly spaced across the widest possible range.

Since Gehan, u-scores have been used to handle single in interval censored data, where only the last date the subject is known to have been negative (LDN) and the first date the subject is known to have been positive (FDP) are available. Subject A experiences the event under investigation ‘later’ than subject B if LDN(A) > FDP(B). For left- and right-censored observations, LDN and FDP are −∞ and +∞, respectively. µ-Scores, consequently, can easily handle censored (including interval-censored) variables (Fig. 2.14), Adding a hierarchical ordering to interval-censored data also provides a solution to analyzing doubly interval censored data, where, for instance, both the date of exposure and the date of disease manifestation are interval censored (Fig. 2.15).

With SNP arrays, which can have an even larger number of variables (currently >100K) than expression arrays, the known sequence of the SNPs on the chromosome (ordinal structure) can be utilized to reduce complexity. From Fig. 2.16, a disease locus in close proximity to a particular SNP (e.g., SNP A) is likely to be highly correlated (in linkage disequilibrium, LD) with this SNP.

A disease locus approximately equidistant to two adjacent loci (SNP C and D) may be in LD with both of the adjacent SNPs. Traditionally, one would look at the plot of the univariate −log_{10} (*p*) by locus and give more confidence to a locus, if several neighboring loci are also in LD. With µ-scores, one can now directly aggregate evidence from neighboring SNPs by assuming that each chromosome consists of intervals around and between SNPs. (Note that we will not require the demarcations between these intervals to be known.) In particular, we will treat each genetic interval as two variables, each contributing to the same disease locus as ‘factor’ (either side of Fig. 2.13).

With high density SNPs, several adjacent intervals may form diplotypes in LD with a single disease locus, the most informative diplotypes being determined by the LD/noise ratio.

Epistasis is defined as an interaction between diplotypes that is associated with a phenotype. Within muStat, epistasis is indicated as a hierarchical structure among diplotypes (Fig. 2.17).

While u-statistics for univariate data[11, 18] and censored data[77–79] are widely used, u-statistics for multivariate data[73, 80] are rarely applied, presumably because they were not presented in an easy-to-use form and transistors had just been put to practical use[87]. A more fundamental problem may have been an even more important hindrance. As the number of variables increases, information content (the proportion of paired comparisons that can be decided) drops fast, until all µ-scores (u-scores for multivariate data) become NA, especially with a ‘strong’ order[82], as compared to its ‘weak’ counterpart[67]. As pointed out recently[88], averaging univariate u-scores[89] or using the lexicographical order avoid this problem, yet require that the relative importance of the variables be constant and known or that less important variables contribute only by breaking ties.

Two recently developed extensions of u-statistics resolve this conundrum. First, the process of scoring multivariate data was separated into first determining the univariate orders first[44] and then combining them[45] into an incomplete order. As the proposed combination of incomplete orders results in yet another incomplete order, a ‘tree’ of incomplete orders can be defined, wherein hierarchical (sub-) factor structures can reflect functional, topological, or temporal relationships. Conversely, ‘µ-factor analysis’ can identify hierarchical factor structures that maximize information content[90]. Second, cumulation of count variables[62] allows for ‘graded’ variables (e.g., ‘mild’ < ‘moderate’ < ‘severe’) but avoids the infinite relative weights implicitly assigned with a lexicographical order. µ-Scores can be used for various analyses, including testing differences between groups defined by simple genotypes with respect to complex phenotypes, correlating complex genotypes with complex phenotypes or identifying genetic variables that explain (correlate best with) a complex phenotype.

Once GE matrices (or scores) are created, further strategies can be employed to incorporate design or model knowledge. For instance, if variables can be ordered, but the outcomes cannot be cumulated (as in Section 4.3.1), Kendall’s[91] correlation coefficient *r _{K}* or, equivalently, the Jonckheere–Terpstra test[92, 93] could be used.

For paired observations the Wilcoxon signed rank test[10] (WSR) is often proposed as a ‘non-parametric’ alternative to the paired t-test. Since the WSR is based on the arithmetic difference between the paired observations, however, the results are not independent of scale transformations. The Lam–Longnecker test[58], instead, is directly based on the WMW test, except that the variance is reduced by a factor 1 − *r _{S}*, where

The focus of many traditional methods is on ‘correlation’ as an indicator of ‘coregulation’, i.e., of variables along the same ‘pathway’. Often, however, different subjects may respond along different pathways, i.e., several pathways may be able to “share the load”[94]. Then, including several variables (or pathways) may provide better discrimination between categories than using individual variables, until the variables included begin to model noise, in which case the information content of the resulting scores would begin to decline (see Section 4.2.3).

µ-Scores enable a paradigm shift by introducing a novel concept of interaction. With the linear model, finding the best discriminating function was often futile, because the choice of the best discriminating set depended on the linearizing transformations and the relative weights chosen. Selecting one among the many ‘best solutions’ generated by the various combinations of subjective transformations and weights was difficult at best. Moreover, the mere assumption that the relative importance among two variables would not depend on the magnitude of these and other variables could easily bias the results. µ-Scores are independent of (monotonous) transformations and (positive) weights, so that solutions are less dependent on the assumptions made and, thus, allow, for the first time, a new type of questions to be approached. Instead of focusing on ‘coregulated’ variables within a pathway, one could search for variables representing ‘collaborating’ pathways.

This paradigm shift has direct implications for applications in systems biology. In Spangler *et al.*[94], who compared activity of dopaminergic receptors between rats addicted to sugar and control rats, the response in the CPU (Fig. 2.18, bottom) was clearly more ‘coordinated’ than in the NÁC (Fig. 2.18, top). In the NAC, bi-variate profiles including D3 and either of D1, D2, pD, and pT discriminated better than their components, and the best discrimination was seen for tri-variate profiles containing both D3 and pT discriminated even better (p < .0001). In the CPU, in contrast, D2 (p = 0.008) discriminated better than any combination.

A recent study comparing patients with Fanconi anemia type C (FA-C) to normal controls[96] identified in a first step 200 differentially expressed genes in uni-variate comparisons (significance indicated by size of nodes in Fig. 2.19). These genes showed some clustering, albeit without hinting to any suggestive interpretation. By multivariate analyses, the genes AURKA and RRM2 emerged as the member of many of the most significant pairs and triplets, respectively. Both genes are targets of approved drugs in cancer treatment, suggesting that these drug might become the first treatments for FA-C patients.

Thus, adding cooperation as a novel concept in system biology opens an additional dimension for the interpretation of functional relationships between molecular variables. By integrating the `muStat` output with systems such as Cytoscape (http://www.cytoscape.org/), collaborative relationships are easily visualized.

After the human genome was decoded in 2004, it was widely expected that medicine would enter a new era of personalized medicine, where common diseases, such as hypertension, obesity, and cancer could be successfully treated. Many of the early results, however, could not be confirmed in subsequent studies.

Fig. 2.20 exemplifies the problems experienced. Traditionally, each SNP is analyzed using either the Cochran–Armitage χ^{2} test for trend[97] with weights (1,1,0), (1,2,3), and (0,1,1) for dominant, additive, and recessive effects, respectively (the 2×2 ‘allelic’ test based on counts of alleles is less appropriate[32] for the same reason as the TDT, see Section 2.2.1, and, thus, omitted). To the dismay of investigators, even the most significant loci often turned out to be false positive, while important loci are often overlooked.

In Fig. 2.20, for instance, comparing 177 cases of children with Childhood Absence Epilepsy (CAE) are compared against three sets of 354 controls, matched against different subsets of Ancestry Informative Markers (AIMs). Not surprisingly, the traditional and the µ-score analyses yield similar results. Only one SNP has at least one max(χ^{2}) / μ p-value below 10^{−3}. Hence, this locus would not have been drawn any attention.

With µ-scores for diplotypes, in contrast, a median p-value below < 10^{−5} (rare for such a small sample size) and the regular triangular pattern of shading seen in Fig. 2.20 clearly stand out. One locus implicated (10^{−5.0}–10^{−6.4}) is close to the SNP that would have been ignored with traditional methods, the other is even more “significant” (10^{−5.2}–10^{−6.5}). The former “significance triangle” points at a single exon and the latter at the promoter region of a splice variant, which is expressed in embryonal brain only and is involved in neuronal development, so that it clearly is a strong candidate gene.

GWAS is primarily a screening strategy to generate hypotheses, which then need to be studied further. As a criterion in a selection procedure[98] (balancing not the risk of false positive results, but the number of loci selected vs the risk of overlooking the most relevant loci), rather than as indicating the confidence in a particular result, p-values must not be judged by their absolute value (mainly a function of sample size). Instead, one may give the high priority for further investigation to those hypotheses, that that have both low p-values and high biologic plausibility.

Given the high “false positive rate” in the early GWAS studies, sample size requirements have since been increased to thousands or even tens of thousands of cases. As an alternative, µ-scores allow to increase the signal/noise ratio by increasing the number of neighboring SNPs that are comprehensively analyzed. As demonstrated in Fig. 2.20, investing more computational effort into better statistical methods can reduce the sample size requirements for GWAS substantially, even for common, complex disease, such as CAE.

Fig. 2.21 shows how representing genetic structure increases sensitivity for detecting association, in general, and epistasis, in particular. Among the 13 chromosomes selected, univariate analysis (bottom border of both diagrams) points to a short range at the begin of Chr 3 and, possibly, Chr 12 as a whole. Moving from SNPs to diplotypes by itself (top border of right diagram) merely shifts the focus on Chr 3 slightly higher. Epistasis between SNPs, as indicated in the square areas, suggests that the range identified on Chr 3 might interact with Chr 11 as a whole and suggests epistasis between Chr 10 and Chr 11. Allowing for epistasis between diplotypes confirms these results, and, in addition, points to ranges on Chr 5 as interacting with ranges on Chr 3, 4, 11, and 12. Moreover, one now sees that the SNPs on several chromosomes are separated into ‘clusters’ interspersed by SNPs that seem to be unrelated.

The above examples tried to explain relatively simple phenotypes, such as addiction to sugar (Fig. 2.18), presence vs absence of a disease (Fig. 2.19), or degree of atherosclerosis in mice (Fig. 2.21). Even with some ‘monogenic’ diseases, such as Fanconi anemia, however, the phenotype involved can be anything but simple. With FA, the phenotype contains various congenital malformations (often binary), life span, time of cancer manifestation or hematological failure (each of them often censored), and laboratory measurements of chromosomal stability (mostly quantitative) (Fig. 2.22 top).

Being able to increase information content by introducing a hierarchical structure offers the opportunity to reverse the goals, i.e., to find the hierarchical structure for a given set of variables that maximizes information content. Diana *et al.*[90], for instance, compared three different models (hierarchical structures) with respect to the information content of the resulting scores.

Clearly, an exhaustive search among all possible hierarchical factor structures is not computationally feasible, even for relatively small numbers of variables, unless a limited number (up to a few thousand) is preselected. In Diana *et al.*[90], this approach was used to decide whether commuters categorize transportation modes by:

**C1: Constraint:**bicycle < motorcycle < car (driver, pax) < transit (bus, tram, metro)**C2: Typology:**two-wheeled (bicycle, motorcycle), car (driver, pax), transit (bus, tram, metro)**C3: Autonomy:**passenger (car pax, bus, tram, metro) < driver (bicycle, motorcycle, car driver).

In this case, the data suggested autonomy as the concept most relevant for individual decision making. In molecular biology, one might easily use the same approach as an alternative to traditional factor (or ‘cluster’) analysis.

Fig. 2.23 shows another possible use of this method, namely to classify object with respect to how much they gain from the use of the hierarchy. With RDM, in particular, it is clear that the hierarchical and non-hierarchical µ-scores are essentially the same, but there exists a population whose scores are improved by using the hierarchy.

A frequent problem in molecular biology is to find a molecular ‘signature’ that discriminates between diseases that present with similar phenotypes, but require different therapeutic interventions. Better targeted diagnostics could improve patient health by avoiding a trial-and-error approach.

In a study of 33 cellular surface markers and six plasma biomarkers as diagnostic markers for two hemophagycytic syndromes[99], univariate u-tests discriminated poorly, with 30.0%, 21.0%, 30.2%, 37.7%, 38.5%, and 31.8% misclassifications, respectively. By µ-scores, the subset of biomarkers that discriminated best in this population (p=10^{−15}, 9.4% misclassifications) consisted of two plasma and two surface markers.

In contrast to traditional methods, discrimination based on µ-scores does not necessarily improve when more variables are added. In fact, less significance with more variables may indicate that noise is being modeled. Moreover, µ-scores are independent of (monotonous) transformations and (positive) weights. Thus, neither the selection of the variables nor the choice of transformations of weights need to be ‘validated’.

For complex diseases, finding signatures that can be applied to the population as a whole has proven to be difficult, if not impossible[100], in part, because different pathways tend to be involved in different sub-populations. As a signature based on µ-scores is ‘intrinsically valid’ for the population in which it is developed (requiring validation for neither the number of variables chosen nor the transformations applied to them), µ-scores may provide a solution for this conundrum. Traditionally, when the same ‘signature’ is used for a wide range of patients, representativity of the population of cases for the subjects to be diagnosed needs to be ‘validated’. However, when a signature is developed based on a sub-population of cases specifically selected to match the subject based on criteria considered relevant by the caring physician (a process, which could involve µ-scores for similarity), the need to ‘validate’ representativity is substantially lessened, so that a personalized signature can be generated *at hoc*. Most importantly, this sub-population of cases is likely to be more homogeneous with respect to the risk factors involved, so that highly predictive signatures will be easier to identify.

In the muStat package for R and S-PLUS, the function `mu.PwO` generates Deuchler’s univariate orderings, `mu.AND` combines them into an incomplete ordering, and `mu.Sums` computes scores and weights from an incomplete ordering (Fig. 2.12). The pseudo-code below demonstrates calculation of the hierarchical (UH) and the non-hierarchical (UNH) µ-scores of Fig. 2.22. The parentheses indicate the hierarchy and variables separated by colon indicate the bounds of interval censored observations.

When using the elementary functions of Fig. 2.12, the statements `mu.Score(x,frml)` and `mu.Sums(mu.AND(mu.PwO(x,frml), frml))$score` are equivalent to the following pseudo-code:

Non-parametric tests are often considered the second choice, to be used primarily, as Friedman implied, “*to avoid the assumption of normality*”[48]. Being based on least square estimates, rather than maximum likelihood, ANOVA is asymptotically distribution-free. As Scheffé demonstrated in Chapter 10 of *The Analysis of Variance*[5], “*Nonnormality has little effect on inferences about means*”, in general, and ANOVA in particular, because the CLT holds for even moderately sized samples. Hence, when methods based on both the linear model and u-statistics are available, the choice between these approaches should primarily rely on the hypothesis to be tested, rather than the empirical distribution of the residuals[101]. If differences of one unit have the same ‘meaning’ across the scale, the linear model would be more appropriate. If, like in many biological applications, no truly ‘linearizing’ transformation exists, tests based on u-statistics often give more meaningful results[9].

Another reason for using methods based on the linear model is that, with the exceptions of some special cases[102, 103], non-parametric methods have not yet been generalized to complex factorial designs. On the other hand, factor and sub-factor structures for graded and censored variables often require restrictive assumptions to be made for the linear model to apply. Hence, a new distinction between parametric and non-parametric methods emerges. The former are more appropriate to reflect controlled structures among independent variables, while the latter may be better suited to reflect uncontrolled structures among dependent variables.

When Fisher introduced structured experimental designs[7], he did so with emphasis on agricultural problems where the assumption of linearity between water, sunshine, and nutrients and the single outcome (yield) is often easily justified. In molecular biology, in contrast, experiments are often simple, but the confounding variables and outcomes can be complex with a structure that is only partially known. Thus, molecular biology may eventually prove as a field where the recent advances in non-parametric statistics are especially useful. In particular, the recent generalizations to structured multi-variate data may, together with sequencing methods and grid or cloud computing may become a equally important recent developments allowing personalized medicine to become a successful strategy for improving health while reducing overall health care cost.

The work was supported in part by Grant Number UL1RR024143 from the U.S. National Center for Research Resources (NCRR). Of the many colleagues who have contributed to this chapter through discussions and suggestions, I would like to thank, in particular, Jose F. Morales, Ephraim Sehayek, Sreeram Ramagopalan, and Martina Durner for their input on the biological background, Sreeram Ramagopalan, Bill Raynor, and Norman Cliff for their helpful comments, an anonymous reviewer for an inspiring discussion, and Daniel Eckardt for help with Latin grammar.

1. Collins FS, Green ED, Guttmacher AE, Guyer MS. A vision for the future of genomics research. Nature. 2003;422:835–847. [PubMed]

2. Butler D. The Grid: Tomorrow's computing today. Nature. 2003;422:799–800. [PubMed]

3. Pearson TA, Manolio TA. How to interpret a genome-wide association study. JAMA. 2008;299:1335–1344. [PubMed]

4. Psychiatric GWAS Consortium Coordinating Committee. Genomewide association studies: history, rationale, and prospects for psychiatric disorders. Am J Psychiatry. 2009;166:540–556. [PMC free article] [PubMed]

5. Scheffé H. The Analysis of Variance. New York, NY: Wiley; 1959.

6. Arbuthnot J. An argument for divine providence taken from the constant regularity observ'd in the births of both sexes. Philos Trans R Soc London. 1710;27:186–190.

7. Fisher RA. The Design of Experiments. Edinburgh: Oliver & Boyd; 1935.

8. Cliff N. Answering ordinal questions with ordinal data using ordinal statistics. Multivariate Behav Res. 1996;31:331–350.

9. Cliff N. Ordinal Methods for Behavioral Data Analysis. Mahwah, NJ: Lawrence Erlbaum; 1996.

10. Wilcoxon F. Individual comparisons by ranking methods. Biometrics. 1954;1:80–83.

11. Mann HB, Whitney DR. On a test of whether one of two random variables is stochastically larger than the other. Ann Math Stat. 1947;18:50–60.

12. Kruskal WH, Wallis WA. Use of ranks in one-criterion variance analysis. J Am Stat Assoc. 1952;47:583–631.

13. Lewis CT, Short C. A Latin Dictionnairy. Oxford: Clarendon; 1879.

14. Georges KE. Ausführliches lateinisch-deutsches Handwörterbuch. Hannover: Hahn; 1918.

15. Tusher VG, Tibshirani R, Chu G. Significance analysis of microarrays applied to the ionizing radiation response (vol 98, pg 5116, 2001) Proc Natl Acad Sci U S A. 2001;98:10515–10515. [PubMed]

16. van de Wiel MA. Significance Analysis of Microarrays using Rank Scores. Kwantitatieve Methoden. 2004;71:25–37.

17. Wang Z, Gerstein M, Snyder M. RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet. 2009;10:57–63. [PMC free article] [PubMed]

18. McNemar Q. Note on the sampling error of the differences between correlated proportions or percentages. Psychometrica. 1947;12:153–157. [PubMed]

19. Gauss CF. Theoria combinationis observationum erroribus minimis obnoxiae. Goettingen: Dieterich; 1823.

20. Coakley CW, Heise MA. Versions of the sign test in the presence of ties. Biometrics. 1996;52:1242–1251.

21. Dixon WJ, Mood AM. The statistical sign test. J Am Stat Assoc. 1946;41:557–566. [PubMed]

22. Dixon WJ, Massey FJJ. An Introduction to Statistical Analysis. New York, NY: McGraw-Hill; 1951.

23. Rayner JCW, Best DJ. Modelling Ties in the Sign Test. Biometrics. 1999;55:663–665. [PubMed]

24. Rao PV, Kupper LL. Ties in paired-comparison experiments: a generalization of the Bradley-Terry model. J Am Stat Assoc. 1967;62:194–204.

25. David HA. The Method of Paired Comparisons. 2nd ed. London: Griffin; 1988.

26. Stern HAL. A continuum of paired comparisons models. Biometrika. 1990;77:265–273.

27. Yan T, Yang YN, Cheng X, DeAngelis MM, Hoh J, Zhang H. Genotypic association analysis using discordant-relative-pairs. Ann Hum Genet. 2009;73:84–94. [PubMed]

28. Spielman RS, McGinnis RE, Ewens WJ. Transmission test for linkage disequilibrium: the insulin gene region and insulin-dependent diabetes mellitus (IDDM) Am J Hum Genet. 1993;52:506–516. [PubMed]

29. Wittkowski KM. Versions of the sign test in the presence of ties. Biometrics. 1998;54:789–791.

30. Wittkowski KM. An asymptotic UMP sign test for discretized data. The Statistician. 1989;38:93–96.

31. Wittkowski KM, Liu X. A statistically valid alternative to the TDT. Hum Hered. 2002;54:157–164. [PubMed]

32. Sasieni PD. From genotypes to genes: doubling the sample size. Biometrics. 1997;53:1253–1261. [PubMed]

33. Wittkowski KM. Friedman-type statistics and consistent multiple comparisons for unbalanced designs. J Am Stat Assoc. 1988;83:1163–1170.

34. Student. On the probable error of a mean. Biometrika. 1908;6:1–25.

35. Ramagopalan SV, McMahon R, Dyment DA, Sadovnick AD, Ebers GC, Wittkowski KM. An extension to a statistical approach for family based association studies provides insights into genetic risk factors for multiple sclerosis in the HLA-DRB1 gene. BMC Medical Genetics. 2009;10:10. [PMC free article] [PubMed]

36. Hafler DA, Compston A, Sawcer S, Lander ES, Daly MJ, De Jager PL, de Bakker PIW, Gabriel SB, Mirel DB, Ivinson AJ, Pericak-Vance MA, Gregory SG, Rioux JD, McCauley JL, Haines JL, Barcellos LF, Cree B, Oksenberg JR, Hauser SL. Risk alleles for multiple sclerosis identified by a genomewide study. N Engl J Med. 2007;357:851–862. [PubMed]

37. Barcellos LF, Sawcer S, Ramsay PP, Baranzini SE, Thomson G, Briggs F, Cree BC, Begovich AB, Villoslada P, Montalban X, Uccelli A, Savettieri G, Lincoln RR, DeLoa C, Haines JL, Pericak-Vance MA, Compston A, Hauser SL, Oksenberg JR. Heterogeneity at the HLA-DRB1 locus and risk for multiple sclerosis. Hum Mol Genet. 2006;15:2813–2824. [PubMed]

38. Ramagopalan SV, Ebers GC. Multiple sclerosis: major histocompatibility complexity and antigen presentation. Genome Med. 2009;1:105. [PMC free article] [PubMed]

39. Suárez-Fariñas M, Haider A, Wittkowski KM. “Harshlighting” small blemishes on microarrays. BMC Bioinformatics. 2005;6:65. [PMC free article] [PubMed]

40. Suarez-Farinas M, Pellegrino M, Wittkowski KM, Magnasco MO. Harshlight: a “corrective make-up” program for microarray chips. BMC Bioinformatics. 2005;6:294. [PMC free article] [PubMed]

41. Arteaga-Salas JM, Harrison AP, Upton GJG. Reducing spatial flaws in oligonucleotide arrays by using neighborhood information. Stat Appl Genet Mol Biol. 2008;7:19. [PubMed]

42. Arteaga-Salas JM, Zuzan H, Langdon WB, Upton GJG, Harrison AP. An overview of image-processing methods for Affymetrix GeneChips. Brief Bioinform. 2008;9:25–33. [PubMed]

43. Cairns JM, Dunning MJ, Ritchie ME, Russell R, Lynch AG. BASH: a tool for managing BeadArray spatial artefacts. Bioinformatics. 2008;24:2921–2922. [PMC free article] [PubMed]

44. Deuchler G. Über die Methoden der Korrelationsrechnung in der Pädagogik und Psychologie. Z pädagog Psychol. 1914;15:114–131. 145–159, 229–242.

45. Morales JF, Song T, Auerbach AD, Wittkowski KM. Phenotyping genetic diseases using an extension of μ-scores for multivariate data. Stat Appl Genet Mol Biol. 2008;7:19. [PubMed]

46. Kehoe JF, Cliff N. Interord: a computer-interactive Fortran IV program for developing simple orders. Educ Psychol Meas. 1975;35:675–678.

47. Kruskal WH. Historical notes on the Wilcoxon unpaired two-sample test. J Am Stat Assoc. 1957;52:356–360.

48. Friedman M. The use of ranks to avoid the assumption of normality implicit in the analysis of variance. J Am Stat Assoc. 1937;32:675–701.

49. Iain M, Urken AB. On elections by ballot. In: Iain M, Urken AB, editors. Classics of Social Choice. Ann Arbor, MI: University of Michigan Press; 1995. pp. 83–89.

50. Hägerle G, Puckelsheim F. Llull's writings on electorial systems. Stud Lulliana. 2001;41:3–38.

51. Benard A, Van Elteren PH. A Generalization of the Method of m Rankings. Indagationes Mathematicae. 1953;15:358–369.

52. van Elteren P, Noether GE. The asymptotic efficiency of the chi_r^2-test for a balanced incomplete block design. Biometrika. 1959;46:475–477.

53. Durbin J. Incomplete blocks in ranking experiments. Br J Psychol. 1951;4:85–90.

54. Bradley RA, Milton ET. Rank Analysis of Incomplete Block Designs: I. The Method of Paired Comparisons. Biometrika. 1952;39:324–345.

55. Prentice MJ. On the problem of m incomplete rankings. Biometrika. 1979;66:167–170.

56. Alvo M, Cabilio P. General scores statistics on ranks in the analysis of unbalanced designs. Can J Stat. 2005;33:115–129.

57. Gao X, Alvo M. A unified nonparametric approach for unbalanced factorial designs. Journal of the American Statistical Association. 2005;100:926–941.

58. Lam FC, Longnecker MT. A modified Wilcoxon rank sum test for paired data. Biometrika. 1983;70:510–513.

59. Cronbach LJ, Meehl PE. Construct validity in psychological tests. Psychol Bull. 1955;52:281–302. [PubMed]

60. Popper KR. Logik der Forschung. Wien: Julius Springer; 1937.

61. Delbecq A. Group techniques for program planning. Glenview, IL: Scott Foresman; 1975.

62. Wittkowski KM, Song T, Anderson K, Daniels JE. U-Scores for Multivariate Data in Sports. J Quant Anal Sports. 2008;4:7. [PMC free article] [PubMed]

63. Freimer N, Sabatti C. The human phenome project. Nat Genet. 2003;34:15–21. [PubMed]

64. Wittkowski KM. Institut für Medizinische Statistik. Göttingen, D.: Georg-August-Universität; 1980. Ein nichtparametrischer Test im Stufenblockplan [A nonparametric test for the step-down design; p. 87.

65. Wittkowski KM. Semiquantitative Merkmale in der nichtparametrischen Statistik. In: Köhler CO, Wagner E, Tautu P, editors. Der Beitrag der Informationsverarbeitung zum Fortschritt der Medizin. Berlin, D.: Springer; 1984. pp. 100–105.

66. Wittkowski KM. Small sample properties of rank tests for incomplete unbalanced designs. Biom J. 1988;30:799–808.

67. Wittkowski KM. An extension to Wittkowski. J Am Stat Assoc. 1992;87:258.

68. Einsele H, Ehninger G, Hebart H, Wittkowski KM, Schuler U, Jahn G, Mackes P, Herter M, Klingebiel T, Löffler J, et al. Polymerase chain reaction monitoring reduces the incidence of cytomegalovirus disease and the duration and side effects of antiviral therapy after bone marrow transplantation. Blood. 1995;86:2815–2820. [PubMed]

69. Talaat M, Wittkowski KM, Husein MH, Barakat R. A new procedure to access individual risk of exposure to cercariae from multivariate questionnaire data. In: Barlow R, Brown JW, editors. Reproductive Health and Infectious Diseases in the Middle East. Aldershot, U.K.: Ashgate; 1998. pp. 167–174.

70. Susser E, Desvarieux M, Wittkowski KM. Reporting sexual risk behavior for HIV: a practical risk index and a method for improving risk indices. Am J Public Health. 1998;88:671–674. [PubMed]

71. Wittkowski KM, Susser E, Dietz K. The protective effect of condoms and nonoxynol-9 against HIV infection. Am J Public Health. 1998;88:590–596. 972. [PubMed]

72. Banchereau J, Palucka AK, Dhodapkar M, Kurkeholder S, Taquet N, Rolland A, Taquet S, Coquery S, Wittkowski KM, Bhardwj N, Pineiro L, Steinman R, Fay J. Immune and clinical responses after vaccination of patients with metastatic melanoma with CD34+ hematopoietic progenitor-derived dendritic cells. Cancer Res. 2001;61:6451–6458. [PubMed]

73. Hoeffding W. A class of statistics with asymptotically normal distribution. Ann Math Stat. 1948;19:293–325.

74. Wittkowski KM. Novel Methods for Multivariate Ordinal Data applied to Genetic Diplotypes, Genomic Pathways, Risk Profiles, and Pattern Similarity. Comp Sci Stat. 2003;35:626–646.

75. Wittkowski KM, Liu X. Beyond the TDT: Rejoinder to Ewens and Spielman. Hum Hered. 2004;58:60–61.

76. Wittkowski KM, Lee E, Nussbaum R, Chamian FN, Krueger JG. Combining several ordinal measures in clinical studies. Stat Med. 2004;23:1579–1592. [PubMed]

77. Gehan EA. A generalised two-sample Wilcoxon test for doubly censored samples. Biometrika. 1965;52:650–653. [PubMed]

78. Gehan EA. A generalised Wilcoxon test for comparing arbitrarily singly censored samples. Biometrika. 1965;52:203–223. [PubMed]

79. Schemper M. A nonparametric k-sample test for data defined by intervals. Statistica Neerlandica. 1983;37:69–71.

80. Lehmann EL. Consistency and unbiasedness of certain nonparametric tests. Ann Math Stat. 1951;22:165–179.

81. Hoeffding W. The Collected Works of Wassily Hoeffding. New York, NY: Springer; 1994.

82. Rosenbaum PG. Coherence in observational studies. Biometrics. 1994;50:368–374. [PubMed]

83. Song T, Coffran C, Wittkowski KM. Screening for gene expression profiles and epistasis between diplotypes with S-Plus on a grid. Stat Comput Graph. 2007;18:20–25.

84. Cherchye L, Vermeulen F. Robust rankings of multidimensional performances: an application to Tour de France racing cyclists. J Sports Econ. 2006;7:359–373.

85. Quaia E, D'Onofrio M, Cabassa P, Vecchiato F, Caffarri S, Pittiani F, Wittkowski KM, Cova MA. Diagnostic value of hepatocellular nodule vascularity after microbubble injection for characterizing malignancy in patients with cirrhosis. Am J Roentgenol. 2007;189:1474–1483. [PMC free article] [PubMed]

86. Ramamoorthi RV, Rossano MG, Paneth N, Gardiner JC, Diamond MP, Puscheck E, Daly DC, Potter RC, Wirth JJ. An application of multivariate ranks to assess effects from combining factors: Metal exposures and semen analysis outcomes. Stat Med. 2008;27:3503–3514. [PubMed]

87. Shockley W, Bardeen J, Brattain WH. The electronic theory of the transistor. Science. 1948;108:678–679.

88. Häberle L, Pfahlberg A, Gefeller O. Assessment of Multiple Ordinal Endpoints. Biom J. 2009;51:217–226. [PubMed]

89. O'Brien PC. Procedures for comparing samples with multiple endpoints. Biometrics. 1984;40:1079–1087. [PubMed]

90. Diana M, Song T, Wittkowski K. Studying travel-related individual assessments and desires by combining hierarchically structured ordinal variables. Transportation. 2009;36:187–206. [PMC free article] [PubMed]

91. Kendall MG. A new measure of rank correlation. Biometrika. 1938;30:81–93.

92. Jonckheere AR. A distribution-free k-sample test against ordered alternatives. Biometrika. 1954;41:133–145.

93. Terpstra TJ. The asymptotic normality and consistency of Kendall's test against trend when ties are present in one ranking. Indagationes Mathematicae. 1952;14:327–333.

94. Spangler R, Wittkowski KM, Goddard NL, Avena NM, Hoebel BG, Leibowitz SF. Opiate-like Effects of Sugar on Gene Expression in Reward Areas of the Rat Brain. Mol Brain Res. 2004;124:134–142. [PubMed]

95. Wittkowski KM, Seybold MP, Schneider EM. Neue nicht-parametrische Methoden und Tools für die Auswertung multivariater Daten in der Klinischen Forschung und Diagnostik. Dtsch Z Klin Forsch. 2008;12:22–27.

96. Morales JF, Song T, Wittkowski KM, Auerbach AD. A statistical systems biology approach to FANCC gene expression suggests drug targets for Fanconi anemia. (submitted)

97. Armitage P. Tests for linear trends in proportions and frequencies. Biometrics. 1955;11:375–386.

98. Lehmann EL. Some model I problems of selection. Ann Math Stat. 1961;32:990–1012.

99. Seybold MP, Wittkowski KM, Schneider EM. Biomarker analysis using a non-parametric selection procedure to discriminate the phagocytic syndromes HLH (hemophagocytic lymphohistiocytosis) and mas (macrophage activation syndrome) Shock. 2008;29:90–90.

100. Kraft P, Hunter DJ. Genetic risk prediction -- Are we there yet? N Engl J Med. 2009;360:1701–1703. [PubMed]

101. Wittkowski KM. Statistical knowledge-based systems — critical remarks and requirements for approval. Computer Methods and Programs in Biomedicine. 1990;33:255–259. [PubMed]

102. Akritas MG, Arnold SF, Brunner E. Nonparametric hypotheses and rank statistics for unbalanced factorial designs, Part I. J Am Stat Assoc. 1997;92:258–265.

103. Brunner E, Munzel U, Puri ML. Rank-score tests in factorial designs with repeated measures. Journal of Multivariate Analysis. 1999;70:286–317.

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |