Home | About | Journals | Submit | Contact Us | Français |

**|**HHS Author Manuscripts**|**PMC2889501

Formats

Article sections

- Abstract
- 1 Introduction and Assessment
- 2 Some Initial Results
- 3 High Dimensional Extensions
- 4 Increasing Power
- 5 Discussion
- References

Authors

Related links

Ann Appl Stat. Author manuscript; available in PMC 2010 June 22.

Published in final edited form as:

Ann Appl Stat. 2009 January 1; 3(4): 1266–1269.

doi: 10.1214/09-AOAS312PMCID: PMC2889501

NIHMSID: NIHMS195766

See other articles in PMC that cite the published article.

We discuss briefly the very interesting concept of Brownian distance covariance developed by Székely and Rizzo (2009) and describe two possible extensions. The first extension is for high dimensional data that can be coerced into a Hilbert space, including certain high throughput screening and functional data settings. The second extension involves very simple modifications that may yield increased power in some settings. We commend Székely and Rizzo for their very interesting work and recognize that this general idea has potential to have a large impact on the way in which statisticians evaluate dependency in data.

The Brownian distance covariance and correlation proposed by Székely and Rizzo (2009) (abbreviated SR hereafter) is a very useful and elegant alternative to the standard measures of correlation and is based on several deep and non-trivial theoretical calculations developed earlier in Székely, Rizzo and Bakirov (2007) (abbreviated SRB hereafter). We congratulate the group on this very original and elegant work. The main result is that a single, simple statistic (*X, Y*) can be used to assess whether two random vectors *X* and *Y*, of possibly different respective dimensions *p* and *q*, are dependent based on an i.i.d. sample.

The proposed statistic (*X, Y*) estimates an interesting population parameter (*X, Y*) that the authors demonstrate can also be expressed as the covariance between independent Brownian motions *W* and *W*′, with *p* and *q* dimensional indices, evaluated at *X* and *Y*, respectively. Specifically, let *W*: * ^{p}* be a real valued, tight, mean-zero Gaussian process with covariance |

By replacing Brownian motion with other stochastic processes, a very wide array of alternative forms of correlation between vectors *X* and *Y* can be generated. In the special case where *p* = *q* = 1 and the stochastic processes *W* and *W*′ are the non-random identify functions centered respectively at *E*(*X*) and *E*(*Y*), (*X, Y*) = *E*[*W* (*X*)*W* (*X*′)*W*′(*Y*)*W*′(*Y*′)] = Cov^{2}(*X, Y*), which is the standard Pearson product-moment covariance squared. Thus the results obtained by SR not only have a profound connection to Brownian motion, but also include traditional measures of dependence as special cases, while, at the same time, have the potential to generate many useful new measures of dependence through the use of other stochastic processes besides Brownian motion. This raises the very real possibility that a broadly applicable and unified theoretical and methodological framework for testing dependence could be developed.

The SR paper is therefore not only important for the specific results contained therein but also for the possibly far reaching consequences for future statistical research in both theory and applications. For the remainder of the paper, we describe two possible extensions of these results. The first extension is for high dimensional data that can be coerced into a Hilbert space, including certain high throughput screening and functional data settings. The second extension involves very simple modifications that may yield increased power in some settings. We first present some initial results and consequences of SR and SRB that will prove useful in later developments. We then present the Hilbert space extension with a few example applications. Some modifications leading to potential variations in power will then be described. The paper will then conclude with a brief discussion.

We now present a few initial results which will be useful in later sections. For a paired sample of size *n*, (*X*_{1}, *Y*_{1}), …, (*X _{n}, Y_{n}*), of realization of (

$$\begin{array}{l}{T}_{1}=\frac{1}{{n}^{2}}\sum _{k,l=1}^{n}{\left|\right|{X}_{k}-{X}_{l}\left|\right|}_{X}{\left|\right|{Y}_{k}-{Y}_{l}\left|\right|}_{Y},\\ {T}_{2}=\frac{1}{{n}^{2}}\sum _{k,l=1}^{n}{\left|\right|{X}_{k}-{X}_{l}\left|\right|}_{X}\times \frac{1}{{n}^{2}}\sum _{k,l=1}^{n}{\left|\right|{Y}_{k}-{Y}_{l}\left|\right|}_{Y},\\ {T}_{3}=\frac{1}{{n}^{3}}\sum _{k=1}^{n}\sum _{l,m=1}^{n}{\left|\right|{X}_{k}-{X}_{l}\left|\right|}_{X}{\left|\right|{Y}_{k}-{Y}_{m}\left|\right|}_{Y},\end{array}$$

and *V _{n}*(

$$\begin{array}{l}{T}_{10}=E[{\left|\right|{X}_{1}-{X}_{2}\left|\right|}_{X}{\left|\right|{Y}_{1}-{Y}_{2}\left|\right|}_{Y}],\\ {T}_{20}=E[{\left|\right|{X}_{1}-{X}_{2}\left|\right|}_{X}]\times E[{\left|\right|{Y}_{1}-{Y}_{2}\left|\right|}_{Y}],\\ {T}_{30}=E[{\left|\right|{X}_{1}-{X}_{2}\left|\right|}_{X}{\left|\right|{Y}_{1}-{Y}_{3}\left|\right|}_{Y}],\end{array}$$

and *V*_{0}(*X, Y*) = *T*_{10}+*T*_{20}−2*T*_{30}. Also let *V _{n}*(

Because this has a standard U-statistic structure, we have the following general result, the proof of which follows from standard theory for U-statistics (see, e.g., Chapter 12 of van der Vaart, 1998):

Provided $E{\left|\right|X\left|\right|}_{X}^{4}<\infty $ and $E{\left|\right|Y\left|\right|}_{Y}^{4}<\infty $, then ${V}_{n}(X,Y)\stackrel{P}{\to}{V}_{0}(X,Y),{V}_{n}(X)\stackrel{P}{\to}{V}_{0}(X)$ and ${V}_{n}(Y)\stackrel{P}{\to}{V}_{0}(Y)$.

In the special case where X and Y are from finite-dimensional Euclidean spaces, we know from Theorems 1–4 of SR that V_{n}(X, Y), V_{n}(X), V_{n}(Y), V_{0}(X, Y), V_{0}(X) and V_{0}(Y) are all non-negative; that
${V}_{n}(X,Y)\le \sqrt{{V}_{n}(X){V}_{n}(Y)}$ and
${V}_{0}(X,Y)\le \sqrt{{V}_{0}(X){V}_{0}(Y)}$; that V_{0}(X) = 0 or V_{0}(Y) = 0 only when X or Y is trivial; that V_{n}(X) = 0 or V_{n}(Y) = 0 only when the X’s or Y ’s in the sample are all identical; that 0 ≤ R_{n}(X, Y), R_{0}(X, Y) ≤ 1; and that V_{0}(X, Y) = 0 only when X and Y are independent.

We now wish to generalize the above results in the finite-dimensional context to a class of norms more broad than Euclidean norms. These results will be useful for later sections. Let *A* and *B* be respectively *p* × *p* and *q* × *q* symmetric, positive definite matrices. Let a “tilde” placed over *T*_{1}, *T*_{2}, *T*_{3}, *V _{n}*,

Let A and B be symmetric and positive definite. Then _{n}(X, Y), _{n}(X), _{n}(Y), _{0}(X, Y), _{0}(X) and _{0}(Y) are all non-negative; and all of the other results in Remark 1 remain true with a “tilde” placed over the given quantities. Moreover, _{0}(X, Y) = 0 if and only if V_{0}(X, Y) = 0.

For a symmetric, positive definite matrix *C*, let *C*^{1/2} denote the symmetric square root of *C*, i.e., *C*^{1/2}*C*^{1/2} = *C*. Note that such a square root always exists and, moreover, is always positive definite. Now define *U* = *A*^{1/2}*X* and *V* = *B*^{1/2}*Y*, and note that |*U*|* _{p}* = ||

The third initial result involves some non-trivial properties of independent components in the finite dimensional setting. Suppose for *X* * ^{p}* and

$$X=\left(\begin{array}{c}{X}^{(1)}+{X}^{(2)}\\ {X}^{(3)}\end{array}\right),\phantom{\rule{0.38889em}{0ex}}\text{and}\phantom{\rule{0.38889em}{0ex}}Y=\left(\begin{array}{c}{Y}^{(1)}+{Y}^{(2)}\\ {Y}^{(3)}\end{array}\right),$$

where *X*^{(1)}, *X*^{(2)} ^{p1}, *X*^{(3)} ^{p2}, *Y* ^{(1)}, *Y* ^{(2)} ^{q1}, *y*^{(3)} ^{q2}; and suppose also that the two vectors = ([*X*^{(2)}]* ^{T},* [

V_{0}(X, Y) = V_{0}(X^{(1)}, Y^{(1)}).

For any *t* * ^{p}* and

$$\begin{array}{l}\mid Eexp(i[{t}^{T}X+{s}^{T}Y])-Eexp({it}^{T}X)Eexp({is}^{T}Y)\mid \\ =\left|{f}_{\stackrel{\sim}{X}}(t){f}_{\stackrel{\sim}{Y}}(s)\left\{Eexp(i[{t}_{1}^{T}{X}^{(1)}+{s}_{1}^{T}{Y}^{(1)}])-Eexp(i{t}_{1}^{T}{X}^{(1)})Eexp(i{s}_{1}^{T}{Y}^{(1)})\right\}\right|\\ =\left|Eexp(i[{t}_{1}^{T}{X}^{(1)}+{s}_{1}^{T}{Y}^{(1)}])-Eexp(i{t}_{1}^{T}{X}^{(1)})Eexp(i{s}_{1}^{T}{Y}^{(1)})\right|\\ =\left|{f}_{{X}^{(1)},{Y}^{(1)}}({t}_{1},{s}_{1})-{f}_{{X}^{(1)}}({t}_{1}){f}_{{Y}^{(1)}}({s}_{1})\right|.\end{array}$$

Combining this with Theorems 1 and 2 of SR, we obtain that

$${V}_{0}(X,Y)=\frac{1}{{c}_{p}{c}_{q}}{\int}_{{\mathbb{R}}^{p+q}}\frac{{\left|{f}_{{X}^{(1)},{Y}^{(1)}}({t}_{1},{s}_{1})-{f}_{{X}^{(1)}}({t}_{1}){f}_{{Y}^{(1)}}({s}_{1})\right|}^{2}}{\mid t{\mid}_{p}^{p+1}\mid s{\mid}_{q}^{q+1}}\mathit{dtds}.$$

Note that the right-hand side is invariant with respect to the distributions of and *Ỹ*, and thus we can replace and *Ỹ* with degenerate random variables fixed at zero. Doing the same on the left-hand side yields the desired result.

The basic idea we propose is to extend the results to Hilbert spaces which can be approximated by sequences of finite-dimensional Euclidean spaces. We will give a few examples shortly. First, we give the conditions for our results. Assume *X* is a random variable in a Hilbert space *H _{X}* with inner produce ·, ·

Let *X* be functional data with realizations that are functions in the Hilbert space *H _{X}* =

$$X(t)=\sum _{i=1}^{\infty}{\lambda}_{i}{Z}_{i}{\phi}_{i}(t),$$

where *Z*_{1}, *Z*_{2}, … are independent random variables with mean zero and variance 1; _{1}, _{2}, … form an orthonormal basis in *L*_{2}[0, 1]; and *λ*_{1}, *λ*_{2}, … are fixed constants satisfying
${\sum}_{i=1}^{n}{\lambda}_{i}^{2}<\infty $. This formulation can yield a large variety of tight stochastic processes and can be a realistic model for some kinds of functional data.

Let *p _{m}* =

$${M}_{m}^{\ast}(f)=\left(\begin{array}{c}{\int}_{0}^{1}{\phi}_{1}(s)f(s)ds\\ \vdots \\ {\int}_{0}^{1}{\phi}_{m}(s)f(s)ds\end{array}\right),$$

and thus ${M}_{m}^{\ast}{M}_{m}$ is the identity by the orthonormality of the basis and is therefore positive definite. Since ${\sum}_{i=1}^{\infty}{\lambda}_{i}^{2}<\infty $,

$$\begin{array}{l}E{\left|\right|X-{X}_{m}\left|\right|}_{X}^{2}=E{\Vert \sum _{i=m+1}^{\infty}{\lambda}_{i}{Z}_{i}{\phi}_{i}(t)\Vert}_{X}^{2}\\ =\sum _{i=m+1}^{\infty}{\lambda}_{i}^{2}\to 0,\end{array}$$

as *m* → ∞. Thus *X* is finitely approximable.

This is basically the same as Example 1, except that we will not require the basis functions to be orthogonal. Specifically, let *X*(*t*) be as given in (1), with the basis functions satisfying
${\int}_{0}^{1}{\phi}_{i}^{2}(s)ds=1$, for all *i* ≥ 1, but not necessary being mutually orthogonal. Let
${a}_{i,j}={\int}_{0}^{1}{\phi}_{i}(s){\phi}_{j}(s)ds$, for *i, j* ≥ 1, and define *A _{m}* to be the

Let *X* = (*X*^{(1)}, *X*^{(2)}, …)* ^{T}* be an infinitely long Euclidean vector in

$$\sum _{i=m+1}^{\infty}E{\left[{X}^{(i)}\right]}^{2}\to 0,$$

as *m* → ∞. It is fairly easy to see that if we let *X _{m}* be a vector with the first

The following lemma tells us that the range-related properties of Brownian distance covariance are preserved for finitely approximable random variables:

Assume that X and Y are both finitely approximable random variables in Hilbert spaces. Then V_{n}(X, Y), V_{n}(X), V* _{n}*(Y), V

Let *X _{m}* and

Our ultimate goal in this section, however, is to show that *R*_{0}(*X, Y*) has the same implications for assessing dependence for finitely approximable Hilbert spaces as it does for finite dimensional settings. This is actually quite challenging, and we are only able to achieve part of the goal in this paper. The following is our first result in this direction:

Suppose X and Y are random variables in finitely approximable Hilbert spaces. Then R_{0}(X, Y) > 0 implies that X and Y are dependent.

Assume that *R*_{0}(*X, Y*) > 0 but that *X* and *Y* are independent. By finite approximability, there exists a sequence of paired random variables (*X _{m}, Y_{m}*) such that

If we could also show that *R*_{0}(*X, Y*) = 0 implies independence, we would have essentially full homology with the finite dimensional case. It is unclear how to show this in general, and it may not even be true in general. However, it is certainly true for an interesting special case which we now present.

Let *X* and *Y* be random variables in finitely approximable Hilbert spaces. Suppose there exists linear maps *M: H _{X}*

Suppose that we are interested in determining whether *X* and *Y* are independent, where *X* is either a functional observation or some other very high dimensional observation and *Y* is a continuous outcome of interest such as a time to an event. Suppose also that *X* is finitely approximable and that any potential dependence of *Y* on *X* is solely due to a latent set of finite principle components of *X*. Such a pair (*X, Y*) would be at most finitely dependent.

The following lemma on finitely dependent data is the final result of this section:

Suppose that X and Y are finitely approximable random variables in Hilbert spaces and that (X, Y) is at most finitely dependent. Then R_{0}(X, Y) ≥ 0 and the inequality is strict if and only if X and Y are dependent.

Note first that
${\left|\right|MX\left|\right|}_{X}^{2}={\langle MX,MX\rangle}_{X}={\langle {M}^{\ast}MX,X\rangle}_{X}={\langle X,X\rangle}_{X}={\left|\right|X\left|\right|}_{X}^{2}$ and, similarly, ||*NY* ||* _{Y}* = ||

Now let * _{m}* =

If we let
${\widehat{U}}_{m}={\left({[{U}_{1}+{U}_{2m}^{(1)}]}^{T},{[{U}_{2m}^{(2)}]}^{T}\right)}^{T}$ and
${\widehat{Z}}_{m}={\left({[{Z}_{1}+{Z}_{2m}^{(1)}]}^{T},{[{Z}_{2m}^{(2)}]}^{T}\right)}^{T}$, the above formulation yields that ||* _{m}*||

We now briefly discuss the issue of power of tests based on *R _{n}*(

We have briefly proposed two generalizations of the Brownian distance covariance, one based on alternative norms to Euclidean norms, and the other based on infinite dimensional data. The first generalization raises the possibility of fine-tuning the statistics proposed in SR to increase power, and the second generalization opens the door for applicability of the results in SR to a broader array of data types, including infinite dimensional data and data with dimension increasing with sample size. However, for both of these generalizations, there remain many open questions that could lead to important further improvements. In either case, the results of SR are very important both practically and theoretically and should result in many important future developments in both the application and theory of statistics.

This research was supported in part by U.S. National Institutes of Health grant CA075142.

- Székely GJ, Rizzo ML. Brownian distance covariance. Annals of Applied Statistics. 2009 In press.
- Székely GJ, Rizzo ML, Bakirov NK. Measuring and testing dependence by correlation of distances. Annals of Statistics. 2007;35:2769–2794.
- van der Vaart AW. Asymptotic Statistics. Cambridge University Press; New York: 1998.

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |