|Home | About | Journals | Submit | Contact Us | Français|
The greatest root distribution occurs everywhere in classical multivariate analysis, but even under the null hypothesis the exact distribution has required extensive tables or special purpose software. We describe a simple approximation, based on the Tracy–Widom distribution, that in many cases can be used instead of tables or software, at least for initial screening. The quality of approximation is studied, and its use illustrated in a variety of setttings.
The greatest root distribution is found everywhere in classical multivariate analysis. It describes the null hypothesis distribution for the union intersection test for any number of classical problems, including multiple response linear regression, MANOVA, canonical correlations, equality of covariance matrices and so on. However, the exact null distribution is difficult to calculate and work with, and so the use of extensive tables or special purpose software has always been necessary.
This paper describes a simple asymptotic approximation, based on the Tracy Widom distribution. The approximation is not solely asymptotic; we argue that it is reasonably accurate over the entire range of the parameters. “Reasonably accurate” means, for example, less than ten percent relative error in the 95th percentile, even when working with two variables and any combination of error and hypothesis degrees of freedom.
This paper focuses on the approximation, its accuracy and its applicability to a range of problems in multivariate analysis. A companion paper [Johnstone (2008)] contains all proofs and additional discussion.
Our main claim is that for many applied purposes, the Tracy–Widom approximation can often, if not quite always, substitute for the elaborate tables and computational procedures that have until now been needed. Our hope is that this paper might facilitate the use of the approximation in applications in conjunction with appropriate software.
To briefly illustrate the Tracy–Widom approximation in action, we revisit the rootstock data, as discussed in Rencher (2002), pages 170–173. In a classical experiment carried out from 1918–1934, apple trees of different rootstocks were compared (Andrews and Herzberg [(1985), pages 357–360] has more detail). Rencher (2002) gives data for eight trees from each of six rootstocks. Four variables are measured for each tree: Girth4= trunk girth at 4 years in mm, Growth4= extension growth at 4 years in m, Girth15= trunk girth at 15 years in mm, and Wt15= weight of tree above ground at 15 years in lb.
A one-way multivariate analysis of variance can be used to examine the hypothesis of equality of the four-dimensional vectors of mean values corresponding to each of the six groups (rootstocks). The standard tests are based on the eigenvalues of (W + B)−1B, where W and B are the sums of squares and products matrices within and between groups respectively. We focus here on the largest eigenvalue, with observed value θobs = 0.652. Critical values of the null distribution depend on parameters, here s = 4, m = 0, n = 18.5 [using (8) below, along with the conventions of Section 5.1 and Definition θ]. Traditionally these are found by reference to tables or charts. Here, the 0.05 critical value is found—after manual interpolation in those tables—to be θ0.05 = 0.377. The approximation (6) of this paper yields the approximate 0.05 critical value , which clearly serves just as well for rejection of the null hypothesis.
It is more difficult in standard packages to obtain p-values corresponding to θobs. The default is to use a lower bound based on the F distribution [see (12)], here pF (θobs) = 1.7× 10−8, which is anti-conservative and several orders of magnitude below the Tracy–Widom approximation given in this paper at (11), pTW (θobs) = 5.6 × 10−5. The latter is much closer to the formally correct value,2 p(θobs) = 3.7 × 10−6. When p-values are very small, typically only the order of magnitude is of interest. We suggest in Section 2.2 that the Tracy–Widom approximation generally comes close to the correct order of magnitude, whereas the default F bound is often off by several orders.
The rest of this introduction provides enough background to state the main Tracy–Widom approximation result. Section 2 focuses on the quality of the approximation by looking both at conventional percentiles and at very small p-values. The remaining Sections 3–6 describe some of the classical uses of the largest root test in multivariate analysis, in each case in enough detail to identify the parameters used. Some extra attention is paid in Section 6 to the multivariate linear model, in view of the wide variety of null hypotheses that can be considered.
Our setting is the distribution theory associated with sample draws from the multivariate normal distribution. For definiteness, we use the notation of Mardia, Kent and Bibby (1979), to which we also refer for much standard background material. Thus, if x1, …, xn denotes a random sample from Np(μ, Σ), a p-variate Gaussian distribution with mean μ and covariance matrix Σ, then we call the n × p matrix X = (x1, …, xn)′, whose ith row contains the ith sample p-vector, a normal data matrix.
A p × p matrix A that can be written A = X′X in terms of such a normal data matrix is said to have a Wishart distribution with scale matrix Σ and degrees of freedom parameter n, A ~ Wp(Σ, n). When p = 1, this reduces to a scaled chi-squared law .
We consider analogs of the F and Beta distributions of multivariate analysis, which are based on two independent chi-squared variates. Thus, let A ~ Wp(Σ, m) be independent of B ~ Wp(Σ, n). If m ≥ p, then A−1 exists and the nonzero eigenvalues of A−1B are quantities of interest that generalize the univariate F ratio. We remark that the scale matrix Σ has no effect on the distribution of these eigenvalues, and so, without loss of generality, we can suppose that Σ = I.
The matrix analog of a Beta variate is based on the eigenvalues of (A + B)−1B, and leads to the following:
Let A ~ Wp(I, m) be independent of B ~ Wp(I, n), where m ≥ p. Then the largest eigenvalue θ of (A + B)−1B is called the greatest root statistic and its distribution is denoted θ (p, m, n).
Since A is positive definite, we have 0 < θ < 1. Clearly θ (p, m, n) can also be defined as the largest root of the determinantal equation
Specific examples will be given below, but in general the parameter p refers to dimension, m to the “error” degrees of freedom and n to the “hypothesis” degrees of freedom. Thus, m + n represents the “total” degrees of freedom.
There are min(n, p) nonzero eigenvalues of A−1B or, equivalently, min(n, p) nonzero roots θ = (θi) of the determinantal equation above. The joint density function of these roots is given by
where Δ(θ) = Πi≠j |θi − θj| (see, e.g., Muirhead [(1982), page 112], or Anderson [(2003), pages 536–537]). We shall not need the explicit form of the density in this paper; it is, however, useful sometimes in matching up the various parameter choices used in different references and packages.
The greatest root distribution has the property
useful, in particular, in the case when n < p [e.g. Mardia, Kent and Bibby (1979), page 84].
Empirical and theoretical investigation has shown that it is useful to develop the approximation in terms of the logit transform of θ; thus, we define
Our main result, stated more formally below, is that with appropriate centering and scaling, W is approximately Tracy–Widom distributed:
The centering and scaling parameters are defined by
where the angle parameters γ, are defined by
The F1 distribution, due to Tracy and Widom (1996) and plotted in Figure 1, has its origins in mathematical physics—see Tracy and Widom (1996); Johnstone (2001) for further details. The density is asymmetric, with mean −1.21 and SD 1.27. Both tails have exponential decay, the left tail like e−|s|3/24 and the right tail like e−(2/3)s3/2.
For the present paper, what is important is that the F1 distribution does not depend on any parameters, and the distribution itself, along with its inverse and percentiles, can be tabulated as univariate special functions. These functions play the same role in this paper as the standard normal distribution Φ, its inverse Φ−1 and percentiles zα play in traditional statistical application.
An R package RMTstat is available at CRAN (cran.r-project.org). It facilitates computation of the distributional approximations and largest root tests described in this paper, and the use of percentiles and random draws from the F1 distribution. Its scope and use is described in more detail in an accompanying report Johnstone et al. (2010). A parallel MATLAB package is in development; it will also contain code to reproduce the figures and table in this paper.
Let fα denote the αth percentile of F1. For example,
Then the αth percentile of θ (p, m, n) is given approximately by
The more formal statement of (3) goes as follows. Assume p, m and n → ∞ together in such a way that
For each s0 , there exist c, C > 0 such that for s ≥ s0,
For the full proof and much more discussion and detail, see the companion paper [Johnstone (2008)].
If A and B are as in the definition of θ (p, m, n), then let (p, m, n) denote the smallest eigenvalue of (A + B)−1B. Its distribution is given by
(note the reversal of m and n!). In particular, the Tracy–Widom distribution will give a generally useful approximation to the lower tail of (p, m, n).
There is an entirely analagous result when A and B follow complex Wishart distributions, with a modified limit distribution F2. Details are given in Johnstone (2008).
There is a substantial literature computing percentage points of the greatest root distribution for selected parameter values, partially reviewed below. The standard paramaterization used in these tables arises from writing the joint density of the roots θi as
In terms of the table parameters and N = 2(s + m + n) + 1, the centering and scaling constants of the Tracy–Widom approximation are given by
We turn to the comparison of percentage points from the Tracy–Widom approximation (6) with the exact values θα for small values of the table parameters (s, m, n). The most extensive tabulations of θα (s, m, n) have been made by William Chen; he has graciously provided the author with the complete version of the tables excerpted in Chen (2002, 2003, 2004b, 2004a).
Figures 2 and and33 plot against θα at 95th and 90th percentiles for s = 2. This is the smallest relevant value of s—otherwise we are in the univariate case covered by F distributions. The bottom panels, in particular, focus on the relative error
Figure 2 shows that even for s = 2, the 95th percentile of the TW approximation has a relative error of less than 1 in 20 except in the zone where both m ≤ 2 and n ≥ 10, where the relative error is still less than 1 in 10. Note that the relative error is always positive in sign, implying that the approximate critical points yield a conservative test. More extensive contour plots covering s = 2(1)6 and 90th, 95th and 99th percentiles may be found in Johnstone and Chen (2007).
There has been a large amount of work to prepare tables or charts for the null distribution of the largest root, much of which is reviewed in Chen (2003). We mention contributions by the following: Nanda (1948, 1951); Foster and Rees (1957); Foster (1957, 1958); Pillai (1955, 1956b, 1956a, 1957a, 1965a, 1967); Pillai and Bantegui (1959); Heck (1960); Krishnaiah (1980); Pillai and Flury (1984); Chen (2002, 2003, 2004b, 2004a).
Constantine (1963) expresses the c.d.f. of the largest root distribution in terms of a matrix hypergeometric function. Koev and Edelman (2006) have developed efficient algorithms (and a MATLAB package available at http://www-math.mit.edu/~plamen) for the evaluation of such matrix hypergeometric functions using recursion formulas from group representation theory.
Koev (2010) collects useful formulas and explains how to use them and mhg to compute the exact c.d.f. and percentiles for the largest root distribution over a range of values of the “MKB” parameters corresponding to m, n, p ≤ 17, and m, n, p ≤ 40 when n − p is odd.
We recall the hypothesis that A ~ Wp(I, m) be distributed independently of B ~ Wp(I, n), and the characterization of the largest eigenvalue given by
For fixed u of unit length, the numerator and denominator are distributed as independent and respectively, and so, again for fixed u, the ratio has an Fn,m distribution. Consequently, we have the simple bound
Using the Fn,m distribution in place of the actual greatest root law yields a lower bound for the significance level, or p-value. We shall see that this bound can be anti-conservative by several orders of magnitude, leading to overstatements of the empirical evidence against the null hypothesis. And furthermore, one can expect that the higher the dimension p of the search space in (10), the worse the bound provided by the F distribution.
The default p-value provided in both SAS and R (through package car) uses this unsatisfactory distribution bound.
Table 1 attempts to capture a variety of scenarios within the computational range of Koev’s software.
Column Exact shows a range of significance levels α covering several orders of magnitude. Column Largest Root shows the corresponding quantiles θα of the largest root distribution, for the given values of (s, m, n)—these are computed using Koev’s MATLAB routine qmaxeigjacobi. Thus, an observed value of θ (s, m, n) = θα would correspond to an exact p-value α.
The remaining columns compare the Tracy–Widom approximation and the F bound. The p-value obtained from the Tracy–Widom approximation is given by
where μ and σ are computed from (9).
The F bound on the p-value is given by
where ν1 = s+ 2m+ 1 and ν2 = s+ 2n+ 1 denote the hypothesis and error degrees of freedom respectively.
The two tables consider s = 2 and 6 variables respectively. The values of m = −0.5 and 5 correspond to s and s + 11 hypothesis degrees of freedom, while the values of n = 2 and 10 translate to s + 5 and s + 21 error degrees of freedom respectively.
At the 10% and 5% levels, the Tracy–Widom approximation is within 20% of the true p-value at s = 6, and within 35% of truth at s = 2. The F-value is wrong by a factor of four or more at s = 2, and by three orders of magnitude at s = 6. At smaller significance levels, the Tracy–Widom approximation generally stays within one order of magnitude of the correct p-value—except at (s, m, n) = (2, −0.5, 10). The F approximation is off by many orders of magnitude when s = 6.
In addition, we note that the Tracy–Widom approximation is conservative in nearly all cases, the exception being for θ ≥ 0.985 in the case (s, m, n) = (6, −0.5, 2). In contrast, the F approximation is always [cf. (12)] anti-conservative, often badly so.
In applications one is often concerned only with the general order of magnitude of the p-values associated with tests of the various hypotheses that are entertained—not least because the assumptions of the underlying model are at best approximately true. For this purpose, then, it may be argued that the TW approximate p-value is often quite adequate over the range of (s, m, n) values. Of course, if (s, m, n) is not too large and greater precision is required, then exact p-values can be sought, using, for example, SAS or Koev’s software.
Let x1, …, xn be a random sample from Np(μ, Σ). Partition the variables into two sets with dimensions p1 and p2 respectively, p1 + p2 = p. Suppose that Σ and the sample covariance matrix S are partitioned correspondingly. We consider testing the null hypothesis of independence of the two sets of variables: Σ12 = 0. The union-intersection test is based on the largest eigenvalue λ1 of (Mardia, Kent and Bibby (1979), page 136) and under H0, λ1 has the greatest root distribution θ (p2, n − 1 − p1, p1). Mardia, Kent and Bibby (1979) consider an example test of independence of n = 25 head length and breadth measurements between first sons and second sons, so that p1 = p2 = 2. The observed value exceeds the critical value θ0.05 = 0.330 found by interpolation from the tables. The Tracy–Widom approximation is found from (6) and serves equally well for rejection of H0 in this case.
Again we have two sets of variables, an x-set with q variables and a y-set with p variables. The goal is to find maximally correlated linear combinations η = a′x and = b′y. We suppose that (X, Y) is a data matrix of n samples (rows) on q + p variables (columns) such that each row is an independent draw from Np+q (μ, Σ). Again let S be the sample covariance matrix, assumed partitioned . The sample squared canonical correlations ( ) for i = 1, …, k = min(p, q) are found as the eigen-values of [Mardia, Kent and Bibby (1979), Sections 10.2.1 and 10.2.2]. The population squared canonical correlations are, in turn, the eigenvalues of . In both cases, we assume that the correlations are arranged in decreasing order. The test of the null hypothesis of zero correlation, H0: ρ1 = ··· = ρk = 0, is based on the largest eigenvalue of M2. Under H0, it is known that has the θ (p, n − q −1, q) distribution, so that the Tracy–Widom approximation can be applied.
Often it may be apparent that the first k canonical correlations are nonzero and the main interest focuses on the significance of , etc. We let Hs denote the null hypothesis that ρs+1 = ··· = ρp = 0, and write (rk|p, q, n; Σ) for the distribution of the rth c.c. under population covariance matrix Σ. When the covariance matrix Σ Hs, the (s + 1)st canonical correlation is stochastically smaller than the largest canonical correlation in a related null model:
If Σ Hs, then
This nonasymptotic result follows from interlacing properties of the singular value decomposition (Appendix). Since is given by the null distribution θ (p, n + s − q − 1, q − s), we may use the latter to provide a conservative p-value for testing Hs. In turn, the p-value for θ (p, n + s − q − 1, q − s) can be numerically approximated as in (6) using the Tracy–Widom distribution.
Waugh (1942) gave perhaps the first significant illustration of CCA using data on n = 138 samples of Canadian Hard Red Spring wheat and the flour made from each of these samples. The aim was to seek highly correlated indices a′x of wheat quality and b′y of flour quality, since a well correlated grading of raw (wheat) and finished (flour) products was believed to promote fair pricing of each. In all, q = 5 wheat characteristics—kernel texture, test weight, damaged kernels, foreign material, crude protein in wheat—and p = 4 flour characteristics—wheat per bushel of flour, ash in flour, crude protein in flour, gluten quality index—were measured. The resulting squared canonical correlations were . The leading correlation would seem clearly significant and, indeed, from our approximate formula (6), .
To assess the second correlation r2, we appeal to the conservative test discussed above based on the null distribution with q −1 = 4, p = 4 and n = 138. The Tracy–Widom approximation , which strongly suggests that this second correlation is significant as well.
Marginal histograms naturally reveal some departures from symmetric Gaussian tails, but they do not seem extreme enough to invalidate the conclusions, which are also confirmed by permutation tests.
Suppose that we have k populations with independent data matrices Xi consisting of ni observations drawn from an Np(μi, Σi) and put n = Σni. This is the one-way multivariate analysis of variance illustrated in Example 1.1. For testing the null hypothesis of equality of means H0: μ1 = ··· = μk, we form, for each population, the sample mean i and covariance matrix Si, normalized so that ni Si ~ Wp(Σ, ni − 1). The basic quantities are the within groups sum of squares W = Σ ni Si ~ Wp(Σ, n − k) and the between group sum of squares B = Σ ni (i − )( i − )′ ~ Wp(Σ, k − 1) under H0, independently of W. The union-intersection test of H0 uses the largest root of W−1B or, equivalently, that of (W + B)−1B, and the latter has, under H0, the θ (p, n − k, k − 1) distribution.
Suppose that independent samples from two normal distributions Np(μ1, Σ1) and Np(μ2, Σ2) lead to covariance estimates Si which are independent and Wishart distributed on ni degrees of freedom: ni Si ~ Wp(ni, Σi) for i = 1, 2. Then the largest root test of the null hypothesis H0: Σ1 = Σ2 is based on the largest eigenvalue θ of (n1S1 + n2S2)−1n2S2, which under H0 has the θ (p, n1, n2) distribution Muirhead (1982), page 332.
The multivariate linear model blends ideas well known from the univariate setting with new elements introduced by correlated multiple responses. In view of the breadth of models covered, and the variety of notation in the literature and in the software, we review the setting in a little more detail, beginning with the familiar model for a single response
Here y is an n × 1 column vector of observations on a response variable, X is an n × q model matrix, and u is an n × 1 column vector of errors, assumed here to be independent and identically distributed as N (0, σ2). The q × 1 vector β of unknown parameters has the least squares estimate—when X has full rank—given by
The error sum of squares SSE = (y − X)′(y − X) = y′Py, where P denotes orthogonal projection onto the subspace orthogonal to the columns of X, it has rank n − q, and so .
Consider the linear hypothesis H0: C1β = 0, where C1 is a g × q matrix of rank g. In the simplest example, C1 = [Ig 0] extracts the first g elements of β; more generally, the rows of C1 are often contrasts among the components of β. To describe the standard F-test of H0, let C2 be any (q − g) × q matrix such that becomes an invertible q × q matrix. We may then write
where we have partitioned C−1 = [C(1) C(2)] into blocks with g and q − g columns respectively.
Let P1 denote the orthogonal projection onto the subspace orthogonal to the columns of XC(2). We have the sum of squares decomposition
and the hypothesis sum of squares for testing H0: C1β = 0 is given by SSH = y′P2y, with P2 = P1 − P. The projection P2 has rank g and so under H0, . The projections P and P2 are orthogonal and so the sums of squares have independent chi-squared distributions, and under H0 the traditional F-statistic
Explicit expressions for the sums of squares are given by
In the multivariate linear model,
the single response y is replaced by p response vectors, organized as columns of the n × p matrix Y. The model (or design) matrix X remains the same for each response; however, there are separate vectors of unknown coefficients and errors for each response; these are organized into a q × p matrix B of regression coefficients and an n × p matrix E of errors. The multivariate aspect of the model is the assumption that the rows of U are indepedent, with multivariate normal distribution having mean 0 and common covariance matrix Σ. Thus, U is a normal data matrix of n samples from Np(0, Σ). Assuming for now that the model matrix X has full rank, the least squares estimator
The linear hypothesis becomes
The sums of squares of the univariate case are replaced by hypothesis and error sums of squares and products matrices:
in which the univariate vectors y and are simply replaced by their multivariate analogs Y and . It follows that E ~ Wp(Σ, n − q) and that under H0, H ~ Wp(Σ, g); furthermore, E and H are independent. Generalizations of the F-test are obtained from the eigenvalues (λi) of the matrix E−1H or, equivalently, the eigenvalues θi of (H + E)−1H.
Thus, under the null hypothesis C1B = 0, Roy’s maximum root statistic θ1 has null distribution
This situation routinely occurs when redundant parameterizations are used, for example, when dealing with factors in analysis of variance models. One approach (e.g., MKB, Section 6.4) is to rearrange the columns of X and partition X = [X1 X2] so that X1 has full rank. We must also assume that the matrix C1 is testable in the sense that, as a function of B, XB = 0 implies C1B = 0. In such cases, if we partition C1 = [C11 C12] conformally with X, then is determined from C11.
A straightforward extension is possible in order to test null hypotheses of the form
where M1 is p × r of rank r. The columns of M1 capture particular linear combinations of the dependent variables—for an example, see, e.g., Morrison (2005), Chapter 3.6.
We simply consider a modified linear model
An important point is that the rows of UM1 are still independent, now distributed as . So we may simply apply the above analysis, replacing Y, U and B by YM1, UM1 and BM1 respectively. In particular, the greatest root statistic now has null distribution given by
Analyses involving the four multivariate tests are provided in a number of SAS routines, such as GLM and CANCORR. The parameterization used here can be translated into that used in SAS by means of the documentation given in the SAS/STAT Users Guide—we refer to the section on Multivariate Tests in version 9.1, page 48 ff. The linear hypotheses correspond to MKB notation via
while the parameters of the greatest root distribution are given by
|error||n − q||v|
(Note: we use sans serif font for the SAS parameters!) Finally, the SAS printouts use the following parameters:
We have described the Tracy–Widom approximation to the null distribution for the largest root test for a variety of classical multivariate procedures. These procedures exhibit varying degrees of sensitivity to the assumption of normality, independence etc. Documenting the sensitivity/robustness of the T–W approximation is clearly an important issue for further work. Two brief remarks can be made. In the corresponding single Wishart setting [e.g., Johnstone (2001)], the largest eigenvalue can be shown, under the null distribution, to still have the T–W limit if the original data have “light tails” (i.e., sub-Gaussian) [see Soshnikov (2002); Péché (2009)]. In the double Wishart settings, simulations for canonical correlation analysis with n = 100 samples on q = 20 and p = 10 variables, each following i.i.d. t(5) or i.i.d. random sign distributions, showed that the T–W distribution for the leading correlation still holds in the central 99% of the distribution.
William Chen graciously provided an electronic copy of his tables for the distribution of the largest root.
If Σ Hs, there are at most s nonzero canonical correlations, and we may suppose without loss of generality that the q x-variables have been transformed so that only the last s of them have any correlation with Y. We employ the singular value decomposition (SVD) description of CCA, cf. Golub and Van Loan (1996), Section 12.4.3. Using QR decompositions, write
Let and form the SVD C = U RVT. Then the diagonal elements r1 ≥ r2 ≥ ··· ≥ rmin(p,q) of R contain the sample canonical correlations.
Now consider the reduced n × (q − s) matrix X− obtained by dropping the last s columns from X. Form the QR decomposition X− = QX− RX−. From the nature of the decomposition, we have QX = [QX− Q+], that is, QX− represents the first q − s columns of QX. Consequently, forms the first q − s rows of C. Our lemma now follows from the interlacing property of singular values [e.g., Golub and Van Loan (1996), Corollary 8.6.3].
Indeed, our earlier discussion implies that X− and Y are independent, and so σ1 (C−) has the null distribution (r1|p, q − s, n; I).
1Supported in part by NSF DMS-05-05303, 09-06812 and NIH RO1 EB 001988.
2This (actually approximate) value is obtained by interpolation from Koev’s function pmaxeigjacobi which only handles integer values of n.