Ann Appl Stat. Author manuscript; available in PMC 2010 June 3.
Published in final edited form as:
Ann Appl Stat. 2009; 3(4): 1616–1633.
PMCID: PMC2880335
NIHMSID: NIHMS199316

# APPROXIMATE NULL DISTRIBUTION OF THE LARGEST ROOT IN MULTIVARIATE ANALYSIS1

## Abstract

The greatest root distribution occurs everywhere in classical multivariate analysis, but even under the null hypothesis the exact distribution has required extensive tables or special purpose software. We describe a simple approximation, based on the Tracy–Widom distribution, that in many cases can be used instead of tables or software, at least for initial screening. The quality of approximation is studied, and its use illustrated in a variety of setttings.

Key words and phrases: Canonical correlation, characteristic root, equality of covariance matrices, greatest root statistic, largest eigenvalue, MANOVA, multivariate linear model, Tracy-Widom distribution

## 1. Introduction

The greatest root distribution is found everywhere in classical multivariate analysis. It describes the null hypothesis distribution for the union intersection test for any number of classical problems, including multiple response linear regression, MANOVA, canonical correlations, equality of covariance matrices and so on. However, the exact null distribution is difficult to calculate and work with, and so the use of extensive tables or special purpose software has always been necessary.

This paper describes a simple asymptotic approximation, based on the Tracy Widom distribution. The approximation is not solely asymptotic; we argue that it is reasonably accurate over the entire range of the parameters. “Reasonably accurate” means, for example, less than ten percent relative error in the 95th percentile, even when working with two variables and any combination of error and hypothesis degrees of freedom.

This paper focuses on the approximation, its accuracy and its applicability to a range of problems in multivariate analysis. A companion paper [Johnstone (2008)] contains all proofs and additional discussion.

Our main claim is that for many applied purposes, the Tracy–Widom approximation can often, if not quite always, substitute for the elaborate tables and computational procedures that have until now been needed. Our hope is that this paper might facilitate the use of the approximation in applications in conjunction with appropriate software.

### 1.1. A textbook example

To briefly illustrate the Tracy–Widom approximation in action, we revisit the rootstock data, as discussed in Rencher (2002), pages 170–173. In a classical experiment carried out from 1918–1934, apple trees of different rootstocks were compared (Andrews and Herzberg [(1985), pages 357–360] has more detail). Rencher (2002) gives data for eight trees from each of six rootstocks. Four variables are measured for each tree: Girth4= trunk girth at 4 years in mm, Growth4= extension growth at 4 years in m, Girth15= trunk girth at 15 years in mm, and Wt15= weight of tree above ground at 15 years in lb.

 Stock Girth4 Growth4 Girth15 Wt15 1 I 111 2.569 358 760 2 I 119 2.928 375 821 … 47 VI 113 3.064 363 707 48 VI 111 2.469 395 952

A one-way multivariate analysis of variance can be used to examine the hypothesis of equality of the four-dimensional vectors of mean values corresponding to each of the six groups (rootstocks). The standard tests are based on the eigenvalues of (W + B)−1B, where W and B are the sums of squares and products matrices within and between groups respectively. We focus here on the largest eigenvalue, with observed value θobs = 0.652. Critical values of the null distribution depend on parameters, here s = 4, m = 0, n = 18.5 [using (8) below, along with the conventions of Section 5.1 and Definition θ]. Traditionally these are found by reference to tables or charts. Here, the 0.05 critical value is found—after manual interpolation in those tables—to be θ0.05 = 0.377. The approximation (6) of this paper yields the approximate 0.05 critical value $θ0.05TW=0.384$, which clearly serves just as well for rejection of the null hypothesis.

It is more difficult in standard packages to obtain p-values corresponding to θobs. The default is to use a lower bound based on the F distribution [see (12)], here pF (θobs) = 1.7× 10−8, which is anti-conservative and several orders of magnitude below the Tracy–Widom approximation given in this paper at (11), pTW (θobs) = 5.6 × 10−5. The latter is much closer to the formally correct value,2 p(θobs) = 3.7 × 10−6. When p-values are very small, typically only the order of magnitude is of interest. We suggest in Section 2.2 that the Tracy–Widom approximation generally comes close to the correct order of magnitude, whereas the default F bound is often off by several orders.

### 1.2. Organization of paper

The rest of this introduction provides enough background to state the main Tracy–Widom approximation result. Section 2 focuses on the quality of the approximation by looking both at conventional percentiles and at very small p-values. The remaining Sections 3–6 describe some of the classical uses of the largest root test in multivariate analysis, in each case in enough detail to identify the parameters used. Some extra attention is paid in Section 6 to the multivariate linear model, in view of the wide variety of null hypotheses that can be considered.

### 1.3. Background

Our setting is the distribution theory associated with sample draws from the multivariate normal distribution. For definiteness, we use the notation of Mardia, Kent and Bibby (1979), to which we also refer for much standard background material. Thus, if x1, …, xn denotes a random sample from Np(μ, Σ), a p-variate Gaussian distribution with mean μ and covariance matrix Σ, then we call the n × p matrix X = (x1, …, xn)′, whose ith row contains the ith sample p-vector, a normal data matrix.

A p × p matrix A that can be written A = XX in terms of such a normal data matrix is said to have a Wishart distribution with scale matrix Σ and degrees of freedom parameter n, A ~ Wp(Σ, n). When p = 1, this reduces to a scaled chi-squared law $σ2χ(n)2$.

We consider analogs of the F and Beta distributions of multivariate analysis, which are based on two independent chi-squared variates. Thus, let A ~ Wp(Σ, m) be independent of B ~ Wp(Σ, n). If mp, then A−1 exists and the nonzero eigenvalues of A−1B are quantities of interest that generalize the univariate F ratio. We remark that the scale matrix Σ has no effect on the distribution of these eigenvalues, and so, without loss of generality, we can suppose that Σ = I.

The matrix analog of a Beta variate is based on the eigenvalues of (A + B)−1B, and leads to the following:

#### Definition θ [Mardia, Kent and Bibby (1979), page 84]

Let A ~ Wp(I, m) be independent of B ~ Wp(I, n), where mp. Then the largest eigenvalue θ of (A + B)−1B is called the greatest root statistic and its distribution is denoted θ (p, m, n).

Since A is positive definite, we have 0 < θ < 1. Clearly θ (p, m, n) can also be defined as the largest root of the determinantal equation

$det[B−θ(A+B)]=0.$

Specific examples will be given below, but in general the parameter p refers to dimension, m to the “error” degrees of freedom and n to the “hypothesis” degrees of freedom. Thus, m + n represents the “total” degrees of freedom.

There are min(n, p) nonzero eigenvalues of A−1B or, equivalently, min(n, p) nonzero roots θ = (θi) of the determinantal equation above. The joint density function of these roots is given by

$p(θ)=C∏i=1min(n,p)θi(∣n−p∣−1)/2(1−θi)(m−p−1)/2Δ(θ),$
(1)

where Δ(θ) = Πij |θiθj| (see, e.g., Muirhead [(1982), page 112], or Anderson [(2003), pages 536–537]). We shall not need the explicit form of the density in this paper; it is, however, useful sometimes in matching up the various parameter choices used in different references and packages.

The greatest root distribution has the property

$θ(p,m,n)=Dθ(n,m+n−p,p),$

useful, in particular, in the case when n < p [e.g. Mardia, Kent and Bibby (1979), page 84].

### 1.4. Main result

Empirical and theoretical investigation has shown that it is useful to develop the approximation in terms of the logit transform of θ; thus, we define

$W(p,m,n)=logitθ(p,m,n)=log(θ(p,m,n)1−θ(p,m,n)).$
(2)

Our main result, stated more formally below, is that with appropriate centering and scaling, W is approximately Tracy–Widom distributed:

$W(p,m,n)−μ(p,m,n)σ(p,m,n)⇒DF1.$
(3)

The centering and scaling parameters are defined by

$μ(p,m,n)=2logtan(φ+γ2),$
(4)

$σ3(p,m,n)=16(m+n−1)21sin2(φ+γ)sinφsinγ,$
(5)

where the angle parameters γ, are defined by

$sin2(γ2)=min(p,n)−1/2m+n−1,sin2(φ2)=max(p,n)−1/2m+n−1.$

### 1.5. More on the Tracy–Widom law

The F1 distribution, due to Tracy and Widom (1996) and plotted in Figure 1, has its origins in mathematical physics—see Tracy and Widom (1996); Johnstone (2001) for further details. The density is asymmetric, with mean −1.21 and SD 1.27. Both tails have exponential decay, the left tail like e−|s|3/24 and the right tail like e−(2/3)s3/2.

Density of the Tracy–Widom distribution F1.

For the present paper, what is important is that the F1 distribution does not depend on any parameters, and the distribution itself, along with its inverse and percentiles, can be tabulated as univariate special functions. These functions play the same role in this paper as the standard normal distribution Φ, its inverse Φ−1 and percentiles zα play in traditional statistical application.

#### Software

An R package RMTstat is available at CRAN (cran.r-project.org). It facilitates computation of the distributional approximations and largest root tests described in this paper, and the use of percentiles and random draws from the F1 distribution. Its scope and use is described in more detail in an accompanying report Johnstone et al. (2010). A parallel MATLAB package is in development; it will also contain code to reproduce the figures and table in this paper.

#### Percentiles

Let fα denote the αth percentile of F1. For example,

$f0.90=0.4501,f0.95=0.9793,f0.99=2.0234.$

Then the αth percentile of θ (p, m, n) is given approximately by

$θα=eμ+fασ/(1+eμ+fασ),$
(6)

where μ = μ(p, m, n), σ = σ (p, m, n) are given by (4) and (5).

The more formal statement of (3) goes as follows. Assume p, m and n → ∞ together in such a way that

$limp∧nm+n>0,limmp>1.$
(7)

For each s0 , there exist c, C > 0 such that for ss0,

$∣P{W(p,m,n)≤μ(p,m,n)+σ(p,m,n)s}−F1(s)∣≤Cp−2/3e−cs.$

For the full proof and much more discussion and detail, see the companion paper [Johnstone (2008)].

##### Smallest eigenvalue

If A and B are as in the definition of θ (p, m, n), then let (p, m, n) denote the smallest eigenvalue of (A + B)−1B. Its distribution is given by

$θ∼(p,m,n)=D1−θ(p,n,m),$

(note the reversal of m and n!). In particular, the Tracy–Widom distribution will give a generally useful approximation to the lower tail of (p, m, n).

##### Complex-valued data

There is an entirely analagous result when A and B follow complex Wishart distributions, with a modified limit distribution F2. Details are given in Johnstone (2008).

## 2. Quality of approximation

### 2.1. Comparison with percentiles

There is a substantial literature computing percentage points of the greatest root distribution for selected parameter values, partially reviewed below. The standard paramaterization used in these tables arises from writing the joint density of the roots θi as

$p(θ)=C∏i=1sθim(1−θi)nΔ(θ).$

From this and (1) it is apparent that our “MKB” parameters (p, m, n) are related to the “Table” parameters (s, m, n) via

Comparison of the Tracy–Widom approximation and F bound for cases with s = 2 and s = 6 variables
$s=min(n,p),p=s,m=(∣n−p∣−1)/2,m=s+2n+1,n=(m−p−1)/2,n=s+2m+1.$
(8)

In terms of the table parameters and N = 2(s + m + n) + 1, the centering and scaling constants of the Tracy–Widom approximation are given by

$sin2(γ2)=(s−12)/N,sin2(φ2)=(s+2m+12)/N$

and

$μ=2logtan(φ+γ2),σ3=16N21sin2(φ+γ)sinφsinγ.$
(9)

We turn to the comparison of percentage points $θαTW$ from the Tracy–Widom approximation (6) with the exact values θα for small values of the table parameters (s, m, n). The most extensive tabulations of θα (s, m, n) have been made by William Chen; he has graciously provided the author with the complete version of the tables excerpted in Chen (2002, 2003, 2004b, 2004a).

Figures 2 and and33 plot $θαTW$ against θα at 95th and 90th percentiles for s = 2. This is the smallest relevant value of s—otherwise we are in the univariate case covered by F distributions. The bottom panels, in particular, focus on the relative error

Comparison of exact and approximate 95th percentiles for s = 2. Top panel: solid line is the Tracy–Widom approximation $θαTW(2,m,n)$ plotted as a function of m for values of n shown. Dashed lines are the exact percentiles θ ...
Comparison of exact and approximate 90th percentiles for s = 4. Top panel: solid line is the Tracy–Widom approximation $θαTW(4,m,n)$ plotted as a function of m for values of n shown. Dashed lines are the exact percentiles θ ...
$r=(θαTW/θα)−1.$

Figure 2 shows that even for s = 2, the 95th percentile of the TW approximation has a relative error of less than 1 in 20 except in the zone where both m ≤ 2 and n ≥ 10, where the relative error is still less than 1 in 10. Note that the relative error is always positive in sign, implying that the approximate critical points yield a conservative test. More extensive contour plots covering s = 2(1)6 and 90th, 95th and 99th percentiles may be found in Johnstone and Chen (2007).

#### Work on tables

There has been a large amount of work to prepare tables or charts for the null distribution of the largest root, much of which is reviewed in Chen (2003). We mention contributions by the following: Nanda (1948, 1951); Foster and Rees (1957); Foster (1957, 1958); Pillai (1955, 1956b, 1956a, 1957a, 1965a, 1967); Pillai and Bantegui (1959); Heck (1960); Krishnaiah (1980); Pillai and Flury (1984); Chen (2002, 2003, 2004b, 2004a).

Because of the dependence on the three parameters, these tables can run up to 25 pages in typical textbooks, such as those of Johnson and Wichern (2002) and Morrison (2005).

#### Code

Constantine (1963) expresses the c.d.f. of the largest root distribution in terms of a matrix hypergeometric function. Koev and Edelman (2006) have developed efficient algorithms (and a MATLAB package available at http://www-math.mit.edu/~plamen) for the evaluation of such matrix hypergeometric functions using recursion formulas from group representation theory.

Koev (2010) collects useful formulas and explains how to use them and mhg to compute the exact c.d.f. and percentiles for the largest root distribution over a range of values of the “MKB” parameters corresponding to m, n, p ≤ 17, and m, n, p ≤ 40 when np is odd.

SAS/STAT 9.0 made available an option for computing exact p-values using Davis (1972); Pillai and Flury (1984). There is also some stand-alone software described by Lutz (1992, 2000).

### 2.2. Accuracy of p-values

#### The univariate F bound

We recall the hypothesis that A ~ Wp(I, m) be distributed independently of B ~ Wp(I, n), and the characterization of the largest eigenvalue given by

$λmax(A−1B)=max∣u∣=1u′Buu′Au.$
(10)

For fixed u of unit length, the numerator and denominator are distributed as independent $χ(n)2$ and $χ(m)2$ respectively, and so, again for fixed u, the ratio has an Fn,m distribution. Consequently, we have the simple bound

$mnλmax(A−1B)>F∼Fn,m.$

Using the Fn,m distribution in place of the actual greatest root law yields a lower bound for the significance level, or p-value. We shall see that this bound can be anti-conservative by several orders of magnitude, leading to overstatements of the empirical evidence against the null hypothesis. And furthermore, one can expect that the higher the dimension p of the search space in (10), the worse the bound provided by the F distribution.

The default p-value provided in both SAS and R (through package car) uses this unsatisfactory distribution bound.

Table 1 attempts to capture a variety of scenarios within the computational range of Koev’s software.

Column Exact shows a range of significance levels α covering several orders of magnitude. Column Largest Root shows the corresponding quantiles θα of the largest root distribution, for the given values of (s, m, n)—these are computed using Koev’s MATLAB routine qmaxeigjacobi. Thus, an observed value of θ (s, m, n) = θα would correspond to an exact p-value α.

The remaining columns compare the Tracy–Widom approximation and the F bound. The p-value obtained from the Tracy–Widom approximation is given by

$PTW(θα)=1−F1((logit(θα)−μ)/σ),$
(11)

where μ and σ are computed from (9).

The F bound on the p-value is given by

$P(θ(s,m,n)>θα)>PF(θα)=1−Fν1,ν2(ν2θα/(ν1(1−θα))),$
(12)

where ν1 = s+ 2m+ 1 and ν2 = s+ 2n+ 1 denote the hypothesis and error degrees of freedom respectively.

The two tables consider s = 2 and 6 variables respectively. The values of m = −0.5 and 5 correspond to s and s + 11 hypothesis degrees of freedom, while the values of n = 2 and 10 translate to s + 5 and s + 21 error degrees of freedom respectively.

At the 10% and 5% levels, the Tracy–Widom approximation is within 20% of the true p-value at s = 6, and within 35% of truth at s = 2. The F-value is wrong by a factor of four or more at s = 2, and by three orders of magnitude at s = 6. At smaller significance levels, the Tracy–Widom approximation generally stays within one order of magnitude of the correct p-value—except at (s, m, n) = (2, −0.5, 10). The F approximation is off by many orders of magnitude when s = 6.

In addition, we note that the Tracy–Widom approximation is conservative in nearly all cases, the exception being for θ ≥ 0.985 in the case (s, m, n) = (6, −0.5, 2). In contrast, the F approximation is always [cf. (12)] anti-conservative, often badly so.

In applications one is often concerned only with the general order of magnitude of the p-values associated with tests of the various hypotheses that are entertained—not least because the assumptions of the underlying model are at best approximately true. For this purpose, then, it may be argued that the TW approximate p-value is often quite adequate over the range of (s, m, n) values. Of course, if (s, m, n) is not too large and greater precision is required, then exact p-values can be sought, using, for example, SAS or Koev’s software.

## 3. Testing for independence of two sets of variables

Let x1, …, xn be a random sample from Np(μ, Σ). Partition the variables into two sets with dimensions p1 and p2 respectively, p1 + p2 = p. Suppose that Σ and the sample covariance matrix S are partitioned correspondingly. We consider testing the null hypothesis of independence of the two sets of variables: Σ12 = 0. The union-intersection test is based on the largest eigenvalue λ1 of $S22−1S21S11−1S12$ (Mardia, Kent and Bibby (1979), page 136) and under H0, λ1 has the greatest root distribution θ (p2, n − 1 − p1, p1). Mardia, Kent and Bibby (1979) consider an example test of independence of n = 25 head length and breadth measurements between first sons and second sons, so that p1 = p2 = 2. The observed value $λ1obs=0.622$ exceeds the critical value θ0.05 = 0.330 found by interpolation from the tables. The Tracy–Widom approximation $θ0.05TW=0.356$ is found from (6) and serves equally well for rejection of H0 in this case.

## 4. Canonical correlation analysis

Again we have two sets of variables, an x-set with q variables and a y-set with p variables. The goal is to find maximally correlated linear combinations η = ax and = by. We suppose that (X, Y) is a data matrix of n samples (rows) on q + p variables (columns) such that each row is an independent draw from Np+q (μ, Σ). Again let S be the sample covariance matrix, assumed partitioned $S=(S11S12S21S22)$. The sample squared canonical correlations ( $ri2$) for i = 1, …, k = min(p, q) are found as the eigen-values of $M2=S22−1S21S11−1S12$ [Mardia, Kent and Bibby (1979), Sections 10.2.1 and 10.2.2]. The population squared canonical correlations $ρi2$ are, in turn, the eigenvalues of $∑22−1∑21∑11−1∑12$. In both cases, we assume that the correlations are arranged in decreasing order. The test of the null hypothesis of zero correlation, H0: ρ1 = ··· = ρk = 0, is based on the largest eigenvalue $r12$ of M2. Under H0, it is known that $r12$ has the θ (p, nq −1, q) distribution, so that the Tracy–Widom approximation can be applied.

### Nonnull cases—a conservative test

Often it may be apparent that the first k canonical correlations are nonzero and the main interest focuses on the significance of $rk+12,rk+22$, etc. We let Hs denote the null hypothesis that ρs+1 = ··· = ρp = 0, and write (rk|p, q, n; Σ) for the distribution of the rth c.c. under population covariance matrix Σ. When the covariance matrix Σ Hs, the (s + 1)st canonical correlation is stochastically smaller than the largest canonical correlation in a related null model:

#### Lemma 1

If Σ Hs, then

$L(rs+1∣p,q,n;∑)

This nonasymptotic result follows from interlacing properties of the singular value decomposition (Appendix). Since $L(r12∣p,q−s,n;I)$ is given by the null distribution θ (p, n + sq − 1, qs), we may use the latter to provide a conservative p-value for testing Hs. In turn, the p-value for θ (p, n + sq − 1, qs) can be numerically approximated as in (6) using the Tracy–Widom distribution.

##### Example

Waugh (1942) gave perhaps the first significant illustration of CCA using data on n = 138 samples of Canadian Hard Red Spring wheat and the flour made from each of these samples. The aim was to seek highly correlated indices ax of wheat quality and by of flour quality, since a well correlated grading of raw (wheat) and finished (flour) products was believed to promote fair pricing of each. In all, q = 5 wheat characteristics—kernel texture, test weight, damaged kernels, foreign material, crude protein in wheat—and p = 4 flour characteristics—wheat per bushel of flour, ash in flour, crude protein in flour, gluten quality index—were measured. The resulting squared canonical correlations were $(r12,r22,r32,r42)=(0.923,0.554,0.056,0.008)$. The leading correlation would seem clearly significant and, indeed, from our approximate formula (6), $θ0.99TW=0.184$.

To assess the second correlation r2, we appeal to the conservative test discussed above based on the null distribution with q −1 = 4, p = 4 and n = 138. The Tracy–Widom approximation $θ0.99TW≈μ+2.023σ≐0.166≪0.554$, which strongly suggests that this second correlation is significant as well.

Marginal histograms naturally reveal some departures from symmetric Gaussian tails, but they do not seem extreme enough to invalidate the conclusions, which are also confirmed by permutation tests.

## 5. Tests of common means or variances

### 5.1. Equality of means for common covariance

Suppose that we have k populations with independent data matrices Xi consisting of ni observations drawn from an Np(μi, Σi) and put n = Σni. This is the one-way multivariate analysis of variance illustrated in Example 1.1. For testing the null hypothesis of equality of means H0: μ1 = ··· = μk, we form, for each population, the sample mean i and covariance matrix Si, normalized so that ni Si ~ Wp(Σ, ni − 1). The basic quantities are the within groups sum of squares W = Σ ni Si ~ Wp(Σ, nk) and the between group sum of squares B = Σ ni (i)( i)′ ~ Wp(Σ, k − 1) under H0, independently of W. The union-intersection test of H0 uses the largest root of W−1B or, equivalently, that of (W + B)−1B, and the latter has, under H0, the θ (p, nk, k − 1) distribution.

### 5.2. Equality of covariance matrices

Suppose that independent samples from two normal distributions Np(μ1, Σ1) and Np(μ2, Σ2) lead to covariance estimates Si which are independent and Wishart distributed on ni degrees of freedom: ni Si ~ Wp(ni, Σi) for i = 1, 2. Then the largest root test of the null hypothesis H0: Σ1 = Σ2 is based on the largest eigenvalue θ of (n1S1 + n2S2)−1n2S2, which under H0 has the θ (p, n1, n2) distribution Muirhead (1982), page 332.

## 6. Multivariate linear model

The multivariate linear model blends ideas well known from the univariate setting with new elements introduced by correlated multiple responses. In view of the breadth of models covered, and the variety of notation in the literature and in the software, we review the setting in a little more detail, beginning with the familiar model for a single response

$y=Xβ+u.$

Here y is an n × 1 column vector of observations on a response variable, X is an n × q model matrix, and u is an n × 1 column vector of errors, assumed here to be independent and identically distributed as N (0, σ2). The q × 1 vector β of unknown parameters has the least squares estimate—when X has full rank—given by

$β^=(X′X)−1X′y.$

The error sum of squares SSE = (yX)′(yX) = yPy, where P denotes orthogonal projection onto the subspace orthogonal to the columns of X, it has rank nq, and so $SSE∼χ(n−q)2$.

Consider the linear hypothesis H0: C1β = 0, where C1 is a g × q matrix of rank g. In the simplest example, C1 = [Ig 0] extracts the first g elements of β; more generally, the rows of C1 are often contrasts among the components of β. To describe the standard F-test of H0, let C2 be any (qg) × q matrix such that $C=(C1C2)$ becomes an invertible q × q matrix. We may then write

$Xβ=[XC(1)XC(2)](C1βC2β),$

where we have partitioned C−1 = [C(1) C(2)] into blocks with g and qg columns respectively.

Let P1 denote the orthogonal projection onto the subspace orthogonal to the columns of XC(2). We have the sum of squares decomposition

$y′P1y=y′Py+y′(P1−P)y$

and the hypothesis sum of squares for testing H0: C1β = 0 is given by SSH = yP2y, with P2 = P1P. The projection P2 has rank g and so under H0, $SSH∼χ(g)2$. The projections P and P2 are orthogonal and so the sums of squares have independent chi-squared distributions, and under H0 the traditional F-statistic

$F=SSH/gSSE/(n−q)∼Fg,n−q.$

Explicit expressions for the sums of squares are given by

$SSE=y′(I−X(X′X)−1X′)y,SSH=(C1β^)′[C1(X′X)−1C1′]−1C1β^.$

In the multivariate linear model,

$Y=XB+U,$

the single response y is replaced by p response vectors, organized as columns of the n × p matrix Y. The model (or design) matrix X remains the same for each response; however, there are separate vectors of unknown coefficients and errors for each response; these are organized into a q × p matrix B of regression coefficients and an n × p matrix E of errors. The multivariate aspect of the model is the assumption that the rows of U are indepedent, with multivariate normal distribution having mean 0 and common covariance matrix Σ. Thus, U is a normal data matrix of n samples from Np(0, Σ). Assuming for now that the model matrix X has full rank, the least squares estimator

$B^=(X′X)−1X′Y.$

The linear hypothesis becomes

$H0:C1B=0.$

The sums of squares of the univariate case are replaced by hypothesis and error sums of squares and products matrices:

$E=Y′PY=Y′(I−X(X′X)−1X′)Y,H=Y′P2Y=(C1B^)′[C1(X′X)−1C1′]−1C1B^,$
(13)

in which the univariate vectors y and are simply replaced by their multivariate analogs Y and . It follows that E ~ Wp(Σ, nq) and that under H0, H ~ Wp(Σ, g); furthermore, E and H are independent. Generalizations of the F-test are obtained from the eigenvalues (λi) of the matrix E−1H or, equivalently, the eigenvalues θi of (H + E)−1H.

Thus, under the null hypothesis C1B = 0, Roy’s maximum root statistic θ1 has null distribution

$θ1∼θ(p,n−q,g),wherep=dimension,g=rank(C1),q=rank(X),n=samplesize.$
(14)

### Two extensions

#### (a) X not of full rank

This situation routinely occurs when redundant parameterizations are used, for example, when dealing with factors in analysis of variance models. One approach (e.g., MKB, Section 6.4) is to rearrange the columns of X and partition X = [X1 X2] so that X1 has full rank. We must also assume that the matrix C1 is testable in the sense that, as a function of B, XB = 0 implies C1B = 0. In such cases, if we partition C1 = [C11 C12] conformally with X, then $C12=C11(X1′X1)−1X1′X2$ is determined from C11.

With these assumptions, we use X1 and C11 in (13) and (14) in place of X and C1.

#### (b) Intra-subject hypotheses

A straightforward extension is possible in order to test null hypotheses of the form

$C1BM1=0,$

where M1 is p × r of rank r. The columns of M1 capture particular linear combinations of the dependent variables—for an example, see, e.g., Morrison (2005), Chapter 3.6.

We simply consider a modified linear model

$YM1=XBM1+UM1.$

An important point is that the rows of UM1 are still independent, now distributed as $Np(0,M1′,∑M)$. So we may simply apply the above analysis, replacing Y, U and B by YM1, UM1 and BM1 respectively. In particular, the greatest root statistic now has null distribution given by

$θ1∼θ(r,n−q,g).$

##### Linear hypotheses in SAS

Analyses involving the four multivariate tests are provided in a number of SAS routines, such as GLM and CANCORR. The parameterization used here can be translated into that used in SAS by means of the documentation given in the SAS/STAT Users Guide—we refer to the section on Multivariate Tests in version 9.1, page 48 ff. The linear hypotheses correspond to MKB notation via

while the parameters of the greatest root distribution are given by

MKBSAS

dimensionrrank(M1)rank(M)p
hypothesisgrank(C1)rank(L)q
errornqv

(Note: we use sans serif font for the SAS parameters!) Finally, the SAS printouts use the following parameters:

$s=p∧q,m=(∣p−q∣−1)/2,n=(v−p−1)/2.$

## 7. Concluding discussion

We have described the Tracy–Widom approximation to the null distribution for the largest root test for a variety of classical multivariate procedures. These procedures exhibit varying degrees of sensitivity to the assumption of normality, independence etc. Documenting the sensitivity/robustness of the T–W approximation is clearly an important issue for further work. Two brief remarks can be made. In the corresponding single Wishart setting [e.g., Johnstone (2001)], the largest eigenvalue can be shown, under the null distribution, to still have the T–W limit if the original data have “light tails” (i.e., sub-Gaussian) [see Soshnikov (2002); Péché (2009)]. In the double Wishart settings, simulations for canonical correlation analysis with n = 100 samples on q = 20 and p = 10 variables, each following i.i.d. t(5) or i.i.d. random sign distributions, showed that the T–W distribution for the leading correlation $r12$ still holds in the central 99% of the distribution.

## Acknowledgments

William Chen graciously provided an electronic copy of his tables for the distribution of the largest root.

## APPENDIX: PROOF OF LEMMA

If Σ Hs, there are at most s nonzero canonical correlations, and we may suppose without loss of generality that the q x-variables have been transformed so that only the last s of them have any correlation with Y. We employ the singular value decomposition (SVD) description of CCA, cf. Golub and Van Loan (1996), Section 12.4.3. Using QR decompositions, write

$X=QXRX,Y=QYRY.$

Let $C=QXTQY$ and form the SVD C = U RVT. Then the diagonal elements r1r2 ≥ ··· ≥ rmin(p,q) of R contain the sample canonical correlations.

Now consider the reduced n × (qs) matrix X obtained by dropping the last s columns from X. Form the QR decomposition X = QX RX. From the nature of the decomposition, we have QX = [QX Q+], that is, QX represents the first qs columns of QX. Consequently, $C−=QX−TQY$ forms the first qs rows of C. Our lemma now follows from the interlacing property of singular values [e.g., Golub and Van Loan (1996), Corollary 8.6.3].

$σs+1(C)≤σ1(C−).$

Indeed, our earlier discussion implies that X and Y are independent, and so σ1 (C) has the null distribution (r1|p, qs, n; I).

## Footnotes

1Supported in part by NSF DMS-05-05303, 09-06812 and NIH RO1 EB 001988.

2This (actually approximate) value is obtained by interpolation from Koev’s function pmaxeigjacobi which only handles integer values of n.

## References

• Anderson TW. An Introduction to Multivariate Statistical Analysis. 3. Wiley; Hoboken, NJ: 2003. MR1990662.
• Andrews DF, Herzberg AM. Data. Springer; New York: 1985.
• Chen WW. Some new tables of the largest root of a matrix in multivariate analysis: A computer approach from 2 to 6. Presented at the 2002 American Statistical Association.2002.
• Chen WW. Table for upper percentage points of the largest root of a determinantal equation with five roots. InterStat. 2003. Available at interstat.statjournals.net.
• Chen WW. The new table for upper percentage points of the largest root of a determinantal equation with seven roots. InterStat. 2004a. Available at interstat.statjournals.net.
• Chen WW. Some new tables for the upper probability points of the largest root of a determinantal equation with seven and eight roots. In: Dalton J, Kilss B, editors. Special Studies in Federal Tax Statistics. Statistics of Income Division, Internal Revenue Service; 2004b. pp. 113–116.
• Constantine AG. Some non-central distribution problems in multivariate analysis. Ann Math Statist. 1963;34:1270–1285. MR0181056.
• Davis AW. On the marginal distributions of the latent roots of the multivariate beta matrix. Ann Math Statist. 1972;43:1664–1670. MR0343465.
• Foster FG. Upper percentage points of the generalized Beta distribution. II. Biometrika. 1957;44:441–453. MR0090199.
• Foster FG. Upper percentage points of the generalized Beta distribution. III. Biometrika. 1958;45:492–503. MR0100366.
• Foster FG, Rees DH. Upper percentage points of the generalized Beta distribution. I. Biometrika. 1957;44:237–247. MR0086462.
• Golub GH, Van Loan CF. Matrix Computations. 3. Johns Hopkins Univ. Press; Baltimore: 1996. MR1417720.
• Heck DL. Charts of some upper percentage points of the distribution of the largest characteristic root. Ann Math Statist. 1960;31:625–642. MR0119301.
• Johnson RA, Wichern DW. Applied Multivariate Statistical Analysis. 6. Pearson Prentice Hall; Upper Saddle River, NJ: 2002.
• Johnstone IM. On the distribution of the largest eigenvalue in principal components analysis. Ann Statist. 2001;29:295–327. MR1863961.
• Johnstone IM. Multivariate analysis and Jacobi ensembles: Largest eigenvalue, Tracy–Widom limits and rates of convergence. Ann Statist. 2008;36:2638–2716. MR2485010. [PubMed]
• Johnstone IM, Chen WW. Finite sample accuracy of Tracy–Widom approximation for multivariate analysis. 2007 JSM Proceedings; Alexandria, VA: Amer. Statist. Assoc; 2007. pp. 1161–1166.
• Johnstone IM, Ma Z, Perry PO, Shahram M. RMTstat: Distributions, statistics and tests derived from random matrix theory. 2010 Manuscript in preparation.
• Koev P. Computing multivariate statistics. 2010 Manuscript in preparation.
• Koev P, Edelman A. The efficient evaluation of the hypergeometric function of a matrix argument. Math Comp. 2006;75:833–846. (electronic). MR2196994.
• Krishnaiah PR. Computations of some multivariate distributions. In: Krishnaiah PR, editor. Handbook of Statistics, Volume 1—Analysis of Variance. North-Holland; Amsterdam: 1980. pp. 745–971.
• Lutz JG. A Turbo Pascal unit for approximating the cumulative distribution function of Roy’s largest root criterion. Educational and Psychological Measurement. 1992;52:899–904.
• Lutz JG. Roy table: A program for generating tables of critical values for Roy’s largest root criterion. Educational and Psychological Measurement. 2000;60:644–647.
• Mardia KV, Kent JT, Bibby JM. Multivariate Analysis. Academic Press; London: 1979. MR0560319.
• Morrison DF. Multivariate Statistical Methods. 4. Thomson; Belmont, CA: 2005.
• Muirhead RJ. Aspects of Multivariate Statistical Theory. Wiley; New York: 1982. MR0652932.
• Nanda DN. Distribution of a root of a determinantal equation. Ann Math Statist. 1948;19:47–57. MR0024106.
• Nanda DN. Probability distribution tables of the largest root of a determinantal equation with two roots. J Indian Soc Agricultural Statist. 1951;3:175–177. MR0044788.
• Péché S. Universality results for largest eigenvalues of some sample covariance matrices ensembles. Probab Theory Related Fields. 2009;143:481–516.
• Pillai KCS. Some new test criteria in multivariate analysis. Ann Math Statist. 1955;26:117–121. MR0067429.
• Pillai KCS. On the distribution of the largest or smallest root of a matrix in multivariate analysis. Biometrika. 1956a;43:122–127. MR0077052.
• Pillai KCS. Some results useful in multivariate analysis. Ann Math Statist. 1956b;27:1106–1114. MR0081852.
• Pillai KCS. Concise Tables for Statisticians. The Statistical Center, Univ. of the Philippines; Manila: 1957.
• Pillai KCS. On the distribution of the largest characteristic root of a matrix in multivariate analysis. Biometrika. 1965;52:405–414. MR0205391.
• Pillai KCS. Upper percentage points of the largest root of a matrix in multivariate analysis. Biometrika. 1967;54:189–194. MR0215433. [PubMed]
• Pillai KCS, Bantegui CG. On the distribution of the largest of six roots of a matrix in multivariate analysis. Biometrika. 1959;46:237–240. MR0102152.
• Pillai KCS, Flury BN. Percentage points of the largest characteristic root of the multivariate beta matrix. Commun Statist Part A. 1984;13:2199–2237. MR0754832.
• Rencher AC. Methods of Multivariate Analysis. 2. Wiley; New York: 2002. MR1885894.
• Soshnikov A. A note on universality of the distribution of the largest eigenvalues in certain classes of sample covariance matrices. J Statist Phys. 2002;108:1033–1056. MR1933444.
• Tracy CA, Widom H. On orthogonal and symplectic matrix ensembles. Commun Math Phys. 1996;177:727–754. MR1385083.
• Waugh FV. Regressions between sets of variables. Econometrica. 1942;10:290–310.

 PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers.