Home | About | Journals | Submit | Contact Us | Français |

**|**HHS Author Manuscripts**|**PMC2967047

Formats

Article sections

- Abstract
- 1 Introduction
- 2 The distribution of correlated normal variates
- 3 Estimation of the correlation parameters
- 4 Applications
- 5 The non-null distribution of z-values
- 6 Remarks
- 7 Summary
- References

Authors

Related links

J Am Stat Assoc. Author manuscript; available in PMC 2010 December 1.

Published in final edited form as:

J Am Stat Assoc. 2010 September 1; 105(491): 1042–1055.

doi: 10.1198/jasa.2010.tm09129PMCID: PMC2967047

NIHMSID: NIHMS164092

Bradley Efron^{*}

See other articles in PMC that cite the published article.

We consider large-scale studies in which there are hundreds or thousands of correlated cases to investigate, each represented by its own normal variate, typically a *z*-value. A familiar example is provided by a microarray experiment comparing healthy with sick subjects' expression levels for thousands of genes. This paper concerns the accuracy of summary statistics for the collection of normal variates, such as their empirical cdf or a false discovery rate statistic. It seems like we must estimate an *N* by *N* correlation matrix, *N* the number of cases, but our main result shows that this is not necessary: good accuracy approximations can be based on the root mean square correlation over all *N* · (*N* − 1)/2 pairs, a quantity often easily estimated. A second result shows that *z*-values closely follow normal distributions even under non-null conditions, supporting application of the main theorem. Practical application of the theory is illustrated for a large leukemia microarray study.

Modern scientific studies routinely produce data on thousands of related situations. A familiar example is a microarray experiment in which thousands of genes are being investigated for possible disease involvement. Each gene might produce a *z*-value, say *z _{i}*, for the

$${H}_{0}:{z}_{i}\sim \mathcal{N}(0,1)$$

(1.1)

under the null hypothesis *H*_{0} of no disease involvement. A great deal of the current literature was developed under the assumption of independence among the *z _{i}*'s. This can be grossly unrealistic in practice, as discussed in Owen (2005) and Efron (2007a), among others. This paper concerns the accuracy of summary statistics of the

Figure 1 concerns a leukemia microarray study by Golub et al. (1999) that we will use for motivation and illustration. Two forms of leukemia are being examined for possible genetic differences: acute lymphoblastic leukemia (ALL) and acute myeloid leukemia (AML). In the version of the data discussed here there are *n*_{1} = 47 ALL patients and *n*_{2} = 25 AML patients, with expression levels on the same *N* = 7128 genes measured on each patient.

Histogram of *z*-values for *N* = 7128 genes, leukemia study, Golub et al. (1999). *Dashed curve *(*x*), smooth fit to histogram; *solid curve* “empirical null”, normal density fit from central 50% of histogram, is much wider than **...**

A two-sample *t*-statistic *t _{i}* comparing AML with ALL expression levels was computed for each gene and converted to a

$${z}_{i}={\Phi}^{-1}\left({F}_{70}\left({t}_{i}\right)\right)\phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}}i=1,2,\dots ,N,$$

(1.2)

where Φ and *F*_{70} are the cumulative distribution functions for a standard normal and a Student-*t* distribution with 70 degrees of freedom. Figure 1 shows a histogram of the *z _{i}*'s, which turns out to be much wider than (1.1) suggests: its central spread is estimated to be

Here is an example of the results to be derived in Sections 2 through 4. Let (*x*) be the right-sided cdf (“survival curve”) of the *z*-values,

$$\widehat{F}\left(x\right)=\#\{{z}_{i}>x\}\u2215N.$$

(1.3)

Then a good approximation for the variance of (*x*) is

$$\mathrm{Var}\left\{\widehat{F}\left(x\right)\right\}\doteq \left\{\frac{\widehat{F}\left(x\right)\left(1-\widehat{F}\left(x\right)\right)}{N}\right\}+{\left\{\frac{{\widehat{\sigma}}_{0}^{2}\widehat{\alpha}{\widehat{f}}^{\left(1\right)}\left(x\right)}{\sqrt{2}}\right\}}^{2}.$$

(1.4)

The first term in (1.4) is the usual binomial variance, while the second term is a *correlation penalty* accounting for dependence between the *z _{i}*'s. The quantities occurring in the correlation penalty are

_{0}, the estimate of central spread (1.68 above);- , an estimate of the root-mean-square of the correlations between the
*N*(*N*− 1)/2 pairs of*z*'s (equaling about .11 for the leukemia data, as calculated from the simple formula in Section 3);_{i} ^{(1)}(*x*), the first derivative of a smooth fit to the*z*-value histogram (estimated by a Poisson spline regression in Figure 1).

The row marked $\widehat{\mathrm{sd}}$ in Table 1 is the square root of formula (1.4) applied to the leukemia data. (4) = .025 is seen to have $\widehat{\mathrm{sd}}=.0040$, more than double ${\widehat{\mathrm{sd}}}_{0}=.0018$, the binomial standard deviation obtained by ignoring the second term in (1.4). The permutation standard deviation, obtained from repeated permutations of the 72 patients, is only .0001 at *x* = 4. Permutation methods, which preserve within-microarray correlations, have been advocated for large-scale hypothesis testing (see Westfall and Young, 1993; Dudoit, Shaffer and Boldrick, 2003, Sect. 2.6), but they are inappropriate for the accuracy considerations of this paper.

Estimates of standard deviation for right-sided cdf (*x*) (1.3); $\widehat{\mathrm{sd}}$ square root of formula (1.4); ${\widehat{\mathrm{sd}}}_{0}$ square root of first term in (1.4); ${\widehat{\mathrm{sd}}}_{\mathrm{perm}}$ permutation standard deviation. Accuracy of False Discovery Rate estimate $\widehat{\mathrm{Fdr}}\left(x\right)$ discussed **...**

Formula (1.4), and more ambitious versions that include covariances across different values of *x*, are derived in Section 2 and Section 3: an exact expression is derived first, followed by a series of simplifying approximations and techniques for their estimation. The basic results are extended to provide more general accuracy estimates in Section 4: comparing, for example, the variability of local versus tail-area false discovery rates.

All of our results depend on the assumption that the *z _{i}*'s are normal with possibly different means and variances,

$${z}_{i}\sim \mathcal{N}({\mu}_{i},{\sigma}_{i}^{2})\phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}}i=1,2,\dots ,N.$$

(1.5)

There is no requirement that they be “*z*-values” in the hypothesis-testing sense of (1.1), and in fact this paper is more concerned with estimation than testing. However, *z*-values are ubiquitous in large-scale applications, and not only in the two-sample setting of the leukemia study. Section 5 concerns the *non-null distribution of z-values*. A theorem is derived justifying (1.5) as a good approximation, allowing results like (1.4) to be applied to the leukemia study *z*-values. Section 6 and Section 7 close with remarks and a brief summary.

The statistics microarray literature has shown considerable interest in the effects of large-scale correlation, some good references being Dudoit, van der Laan and Pollard (2004), Owen (2005), Qiu, Klebanov and Yakovlev (2005b), Qiu, Brooks, Klebanov and Yakovlev (2005a) and Desai, Deller and McCormick (2009). Efron (2007a) used a *z*-value setting to examine the effects of correlation on false discovery rate analysis; that paper's Section 2 theorem is a null hypothesis version of the general result developed here. A useful extension along different lines appears in Schwartzman and Lin (2009).

Clarke and Hall's (2009) asymptotic calculations support the use of the independence standard deviation ${\widehat{\mathrm{sd}}}_{0}$ in Table 1, even in the face of correlation. The situations they consider are low-correlation by the standard here, with the root-mean-square value of (1.4) approaching zero (from their assumption (3.2)). Since is often easy to estimate, formulas such as (1.4) provide a quantitative check on the use of ${\widehat{\mathrm{sd}}}_{0}$.

Given *N* correlated normal variates *z*_{1}, *z*_{2}, … , *z _{N}*, with possibly different means and standard deviations, let (

$$\widehat{F}\left(x\right)=\#\{{z}_{i}\ge x\}\u2215N,\phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}}\text{for}\phantom{\rule{thinmathspace}{0ex}}-\infty <x<\infty .$$

(2.1)

This section presents tractable formulas for the mean and covariance of the process {(*x*), −∞ < *x* < ∞}, and a simpler approximation that we will see is nicely suited for applications.

Rather than work directly with cdfs, it will be easier, and in a sense more basic, to first derive results for a discretized version of the empirical *density* of the *z _{i}* values. We partition the range

$$\mathcal{Z}=\bigcup _{k=1}^{K}{\mathcal{Z}}_{k},$$

(2.2)

each bin being of width Δ. Let *x _{k}* indicate the midpoint of

$${y}_{k}=\#\{{z}_{i}\in {\mathcal{Z}}_{k}\}\phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}}k=1,2,\dots ,K.$$

(2.3)

We will derive expressions for the mean and covariance of the vector ** y** = (

Suppose that the *z _{i}*'s are divided into a finite number of classes, with members of the

$${z}_{i}\sim \mathcal{N}({\mu}_{c},{\sigma}_{c}^{2})\phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}}\text{for}\phantom{\rule{thinmathspace}{0ex}}{z}_{i}\in {\mathcal{C}}_{c}.$$

(2.4)

Let *N _{c}* be the number of members of

$${N}_{c}=\#\left\{{\mathcal{C}}_{c}\right\}\phantom{\rule{1em}{0ex}}\text{and}\phantom{\rule{1em}{0ex}}{p}_{c}={N}_{c}\u2215N$$

(2.5)

so Σ* _{c} N_{c}* =

If ** x** is the

$${\mathit{x}}_{c}=(\mathit{x}-{\mu}_{c})\u2215{\sigma}_{c}={(\dots ,{x}_{\mathit{kc}},\dots )}^{\prime}.$$

(2.6)

Likewise, for any real-valued function *h*(*x*) we define * h_{c}* to be the

$${\mathit{h}}_{c}={(\dots ,h\left({x}_{\mathit{kc}}\right),\dots )}^{\prime},$$

(2.7)

also denoted by *h*(*x _{c}*) in what follows.

It is easy to calculate the expectation of the count vector ** y** under the multi-class model (2.4)-(2.5). Let

$${\pi}_{\mathit{kc}}={\mathrm{Prob}}_{c}\{{z}_{i}\in {\mathcal{Z}}_{k}\}\doteq \Delta \phi \left({x}_{\mathit{kc}}\right)\u2215{\sigma}_{c}.$$

(2.8)

Here (*x*) = exp(−*x*^{2}/2)/√2π, the standard normal density. The approximation *π _{kc}* Δ(

$$E\left\{\mathit{y}\right\}=N\sum _{c}{p}_{c}{\pi}_{c}=N\Delta \sum _{c}{p}_{c}\phi \left({\mathit{x}}_{c}\right)=N\Delta \sum _{c}{p}_{c}{\phi}_{c}.$$

(2.9)

The *K* × *K* covariance matrix of the count vector ** y** depends on the

$${\lambda}_{\rho}(u,v)=\frac{{\phi}_{\rho}(u,v)}{\phi \left(u\right)\phi \left(v\right)}-1={(1-{\rho}^{2})}^{-\frac{1}{2}}\phantom{\rule{thinmathspace}{0ex}}\mathrm{exp}\left\{\frac{2\rho \mathit{uv}-{\rho}^{2}({u}^{2}+{v}^{2})}{2(1-{\rho}^{2})}\right\}-1$$

(2.10)

and

$$\lambda (u,v)={\int}_{-1}^{1}{\lambda}_{\rho}(u,v)g\left(\rho \right)d\rho $$

(2.11)

(the integral notation being shorthand for summing over *M* discrete points).

*Under the multi-class model* (2.4)-(2.5), *the covariance of the count vector *** y** (2.3)

$$\mathbf{cov}\left(\mathit{y}\right)={\mathbf{cov}}_{0}+{\mathbf{cov}}_{1}$$

(2.12)

*where*

$${\mathbf{cov}}_{0}=N\sum _{c}{p}_{c}\phantom{\rule{thinmathspace}{0ex}}\{\mathrm{diag}\left({\pi}_{c}\right)-{\pi}_{c}{\pi}_{c}^{\prime}\}$$

(2.13)

*and*

$$\begin{array}{cc}\hfill {\mathbf{cov}}_{1}={N}^{2}& \sum _{c}\sum _{d}{p}_{c}{p}_{d}\phantom{\rule{thinmathspace}{0ex}}\mathrm{diag}\left({\pi}_{c}\right){\lambda}_{\mathit{cd}}\phantom{\rule{thinmathspace}{0ex}}\mathrm{diag}\left({\pi}_{d}\right)\hfill \\ \hfill & -N\sum _{c}{p}_{c}\phantom{\rule{thinmathspace}{0ex}}\mathrm{diag}\left({\pi}_{c}\right){\lambda}_{\mathit{cc}}\phantom{\rule{thinmathspace}{0ex}}\mathrm{diag}\left({\pi}_{c}\right).\hfill \end{array}$$

(2.14)

Here diag(* π_{c}*) is the

*Note.* Equation (2.14) assumes that the correlation distribution *g*(*ρ*) is the same across all classes * _{c}*. The proof of Lemma 1, which is similar to that for the simpler situation of Efron (2007a), appears in Remark C of Section 6.

The **cov**_{0} term in (2.12)-(2.13) is the sum of the multinomial covariance matrices that would apply if the *z _{i}*'s were mutually independent with fixed numbers drawn from each class;

Expression (2.14) for the correlation penalty can be considerably simplified. *Mehler's identity* for λ_{ρ}(*u*, *v*) (2.10) is

$${\lambda}_{\rho}(u,v)=\sum _{j\ge 1}\frac{{\rho}^{j}}{j!}{h}_{j}\left(u\right){h}_{j}\left(v\right)$$

(2.15)

where *h _{j}* is the

$${\alpha}_{j}={\int}_{-1}^{1}{\rho}^{j}g\left(\rho \right)d\rho ,$$

(2.16)

(2.11) becomes

$$\lambda (u,v)=\sum _{j\ge 1}\frac{{\alpha}_{j}}{j!}{h}_{j}\left(u\right){h}_{j}\left(v\right)$$

(2.17)

so **λ**_{cd} in (2.14) can be written in outer product notation as

$${\lambda}_{\mathit{cd}}=\sum _{j\ge 1}\frac{{\alpha}_{j}}{j!}{h}_{j}\left({\mathit{x}}_{c}\right){h}_{j}{\left({\mathit{x}}_{d}\right)}^{\prime}.$$

(2.18)

Making use of (2.8), taken as exact,

$$\begin{array}{cc}\hfill \mathrm{diag}\left({\pi}_{c}\right){h}_{j}\left({\mathit{x}}_{c}\right)& =N\Delta \mathrm{diag}\phantom{\rule{thinmathspace}{0ex}}\left(\phi \left({\mathit{x}}_{c}\right)\right){h}_{j}\left({\mathit{x}}_{c}\right)\u2215{\sigma}_{c}\hfill \\ \hfill & ={(-1)}^{j}N\Delta \cdot {\phi}_{c}^{\left(j\right)}\u2215{\sigma}_{c}\hfill \end{array}$$

(2.19)

where ${\phi}_{c}^{\left(j\right)}$ indicates the *j*th derivative of (*u*) evaluated at each component of * x_{c}* (using

Rearranging (2.14) then gives a simplified formula.

*Defining*

$${\stackrel{\u2012}{\varphi}}^{\left(j\right)}\equiv \sum _{c}{p}_{c}{\phi}_{c}^{\left(j\right)}\u2215{\sigma}_{c},$$

(2.20)

(2.14) *for the correlation penalty becomes*

$${\mathbf{cov}}_{1}={N}^{2}{\Delta}^{2}\left\{\sum _{j\ge 1}\frac{{\alpha}_{j}}{j!}{\stackrel{\u2012}{\varphi}}^{\left(j\right)}{\stackrel{\u2012}{\varphi}}^{{\left(j\right)}^{\prime}}-\frac{1}{N}\sum _{j\ge 1}\frac{{\alpha}_{j}}{j!}\left(\sum _{c}{p}_{c}{\phi}_{c}^{\left(j\right)}{\phi}_{c}^{{\left(j\right)}^{\prime}}\u2215{\sigma}_{c}^{2}\right)\right\}.$$

(2.21)

A convenient approximation to **cov**_{1} is based on three reductions of (2.21):

- The second term in (2.21) is neglible for large
*N*. - Common standardization methods for large-scale data sets often make
*α*_{1}, the expectation of*g*(*ρ*), exactly or nearly zero, as illustrated in Section 3 for the leukemia data; see Section 3 of Efron (2009).

The root mean square (rms) correlation

$$\alpha ={\alpha}_{2}^{1\u22152}={\left[{\int}_{-1}^{1}{\rho}^{2}g\left(\rho \right)d\rho \right]}^{\frac{1}{2}}$$

(2.22)

featured in Efron (2007a) (where it is called the *total correlation*), is a single-number summary of *z _{i}*'s entire correlation structure. Carrying out the three reductions above produces a greatly simplified form of (2.21),

$$\mathit{rms}\phantom{\rule{thinmathspace}{0ex}}\mathit{approximation}:\phantom{\rule{1em}{0ex}}{\mathbf{cov}}_{1}\doteq {\left(N\Delta \alpha \right)}^{2}{\stackrel{\u2012}{\varphi}}^{\left(2\right)}{\stackrel{\u2012}{\varphi}}^{{\left(2\right)}^{\prime}}\u22152$$

(2.23)

with ^{(2)} in (2.20) depending on the second derivative of the normal density, ^{(2)}(*u*) = (*u*) · (*u*^{2} − 1).

Figure 2 compares the exact formulas (2.12)-(2.14) for **cov**(** y**) with the simplified formula based on the rms approximation (2.23); for a numerical example having

$$({\mu}_{0},{\sigma}_{0})=(0,1),\phantom{\rule{thinmathspace}{0ex}}{p}_{0}=.95\phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}}\text{and}\phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}}({\mu}_{1},{\sigma}_{1})=(2.5,1),\phantom{\rule{thinmathspace}{0ex}}{p}_{1}=.05$$

(2.24)

but then recentered as in the leukemia example; see Remark D of Section 6 for more detail. The plotted curves show the standard deviations $\mathrm{sd}\left\{{y}_{k}\right\}={\mathrm{cov}}_{\mathit{kk}}{\left(\mathit{y}\right)}^{\frac{1}{2}}$ from (2.12), the corresponding rms approximation (2.23), and also ${\mathrm{sd}}_{0}\left\{{y}_{k}\right\}={\left({\mathrm{cov}}_{0\mathit{kk}}\right)}^{\frac{1}{2}}$ from (2.13). We can see there is a substantial correlation penalty over most of the range of *z*, and also that the rms approximation is quite satisfactory here.

Returning to right-sided cdfs (2.1), let ** B** be the

$${\mathit{B}}_{{\mathit{kk}}^{\prime}}=\{\begin{array}{cc}1\hfill & \text{if}\phantom{\rule{thinmathspace}{0ex}}k\le {k}^{\prime}\hfill \\ 0\hfill & \text{if}\phantom{\rule{thinmathspace}{0ex}}k>{k}^{\prime}\hfill \end{array}\phantom{\}}$$

(2.25)

so

$$\widehat{\mathit{F}}=\frac{1}{N}\mathit{By}$$

(2.26)

is a *K*-vector with *k*th component the proportion of *z _{i}*'s in bins indexed ≥

$${\widehat{F}}_{k}=\#\{{z}_{i}\ge {x}_{k}-\Delta \u22152\}\u2215N\phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}}(k=1,2,\dots ,K).$$

(2.27)

(** B** would be transposed if we were dealing with left-sided cdfs.) The expectation of

$$\begin{array}{cc}\hfill E\left\{{\widehat{F}}_{k}\right\}& =\sum _{c}{p}_{c}\left[\sum _{{k}^{\prime}\ge k}\Delta \phi \left(\frac{{x}_{{k}^{\prime}}-{\mu}_{c}}{{\sigma}_{c}}\right)/{\sigma}_{c}\right]\doteq \sum _{c}{p}_{c}{\int}_{{x}_{\mathit{kc}}}^{\infty}\phi \left(u\right)\mathit{du}\hfill \\ \hfill & =\sum _{c}{p}_{c}{\Phi}^{+}\left({x}_{\mathit{kc}}\right)\hfill \end{array}$$

(2.28)

where Φ^{+}(*u*) = 1 − Φ(*u*). Now that we are working with tail areas rather than densities we can let Δ → 0, making (2.28) exact.

** has covariance matrix **** Bcov**(

*Under the multiclass model* (2.4)-(2.5),

$$\mathbf{Cov}\left(\widehat{\mathit{F}}\right)={\mathbf{Cov}}_{0}+{\mathbf{Cov}}_{1}$$

(2.29)

*where ***Cov**_{0 }*has klth entry*

$$\frac{1}{N}\sum _{c}{p}_{c}\left\{{\Phi}^{+}\left(\mathrm{max}({x}_{\mathit{kc}},{x}_{\mathit{lc}})\right)-{\Phi}^{+}\left({x}_{\mathit{kc}}\right){\Phi}^{+}\left({x}_{\mathit{lc}}\right)\right\}$$

(2.30)

*and*

$${\mathbf{Cov}}_{1}=\sum _{j}\frac{{\alpha}_{j}}{j!}{\stackrel{\u2012}{\phi}}^{(j-1)}{\stackrel{\u2012}{\phi}}^{{(j-1)}^{\prime}}-\frac{1}{N}\sum _{j}\frac{{\alpha}_{j}}{j!}\left\{\sum _{c}{p}_{c}{\phi}_{c}^{(j-1)}{\phi}_{c}^{{(j-1)}^{\prime}}\right\}.$$

(2.31)

Here *p _{c}* is from (2.5),

$${\stackrel{\u2012}{\phi}}^{(j-1)}=\sum _{c}{p}_{c}{\phi}_{c}^{(j-1)}=\sum _{c}{p}_{c}{\phi}^{(j-1)}\left({\mathit{x}}_{c}\right).$$

(2.32)

(Notice the distinction between * and (2.20), and between ***Cov** and **cov** etc., Lemma 1.)

The three-step reduction leading to (2.31) also can be applied to **Cov**_{1}: for *α* as in (2.22),

$$\mathit{Rms}\phantom{\rule{thinmathspace}{0ex}}\mathrm{approximation}:\phantom{\rule{1em}{0ex}}{\mathbf{Cov}}_{1}\doteq {\alpha}^{2}{\stackrel{\u2012}{\phi}}^{\left(1\right)}{\stackrel{\u2012}{\phi}}^{{\left(1\right)}^{\prime}}\u22152$$

(2.33)

with ^{(1)} depending on the first derivative of the normal density, ^{(1)}(*u*) = −(*u*)*u*. Section 3 shows that (2.33) is especially convenient for applications.

Figure 3 is the version of Figure 2 applying to **: the heavy curve tracks sd(*** _{k}*) from (2.29), the dotted curve is from Rms approximation (2.33), and the dashed curve shows the standard deviations from

The **Cov**_{0} formula (2.30) is essentially the covariance function for a Brownian bridge. Results related to Theorem 1 can be found in the empirical process literature; see equation (2.2) of Csörgő and Mielniczuk (1996) for example, which applies to the “one-class” case when all the *z _{i}*'s are (0, 1). Desai et al. (2009) extend the covariance calculations in Efron (2007a) to include skewness corrections.

Application of Section 2's theory requires us to estimate several parameters: the rms correlation *α* (2.22), and the class components (*p _{c}*,

The *i*th row of ** X** gives

We wish to estimate the rms correlation *α* (2.22). Let *X*_{0} indicate an *N* × *n*_{0} subset of ** X** pertaining to a single population of subjects, for example the 47 ALL patients. There are

$$\widehat{\rho}\sim (m,v),$$

(3.1)

(*m*, *v*) = (.002,.190^{2}) for the ALL patients. As discussed in Section 3 of Efron (2009), standardizing the columns of *X*_{0} to have mean 0 forces *m* 0, and we will assume *m* = 0 in what follows. (This is equivalent to taking *α*_{1} = 0 as we did following (2.21).)

The obvious choice $\stackrel{-}{\alpha}={v}^{\frac{1}{2}}$ tends to greatly overestimate *α* : each _{ii′} is nearly unbiased for its true correlation *ρ*_{ii′}, a normal-theory approximation for mean and variance being

$${\widehat{\rho}}_{{\mathit{ii}}^{\prime}}\sim \left({\rho}_{{\mathit{ii}}^{\prime}},{(1-{\rho}_{{\mathit{ii}}^{\prime}}^{2})}^{2}\u2215(n-3)\right)$$

(3.2)

(Johnson and Kotz, 1970), but the considerable variances in (3.2) can greatly broaden the empirical distribution of the 's. Two corrected estimates of *α* are developed in Efron (2009). The simpler correction formula is

$${\widehat{\alpha}}^{2}=\frac{{n}_{0}}{{n}_{0}-1}\left(v-\frac{1}{{n}_{0}-1}\right)$$

(3.3)

based on an identity between the row and column correlations of *X*_{0}. The second approach uses an empirical Bayes analysis of the variance term in (3.3) to justify a more elaborate formula,

$${\stackrel{~}{\alpha}}^{2}=\stackrel{~}{v}-\frac{3}{n-5}{\stackrel{~}{v}}^{2}\phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}}\left[\stackrel{~}{v}=\frac{(n-3)v-1}{n-5}\right].$$

(3.4)

The first three columns of Table 2 compare with for *X*_{0} based on the ALL patients, the AML patients, and both. The final column reports mean ± standard deviation for and in 100 simulations of model (2.24): *N* = 6000, *n*_{1} = *n*_{2} = 40 patients in each class, true *α* = .10; see Remark D. The two estimates are effectively linear functions of each other for typical values of *v*; , the simpler choice, is preferred by the author.

It seems that we need to estimate the class components (*p _{c}*,

The marginal density *f*(*z*) under model (2.4)-(2.5) is

$$f\left(z\right)=\sum _{c}{p}_{c}\phi \left(\frac{z-{\mu}_{c}}{{\sigma}_{c}}\right)\frac{1}{{\sigma}_{c}};$$

(3.5)

so, letting ** f** =

$${\mathbf{cov}}_{0}=N\left\{\mathrm{diag}\left(\Delta \mathit{f}\right)-\sum _{c}{p}_{c}{\pi}_{c}{\pi}_{c}^{\prime}\right\}.$$

(3.6)

Here we are assuming, as in (2.5), that the class sample sizes *N _{c}* are fixed. A more realistic assumption might be that the numbers

$${\mathbf{cov}}_{0}=N\left\{\mathrm{diag}\left(\Delta \mathit{f}\right)-{\Delta}^{2}{\mathit{ff}}^{\prime}\right\}.$$

(3.7)

A smooth curve ** fit to the histogram heights**^{2} as in Figure 1 then yields ${\widehat{\mathbf{cov}}}_{0}$ by substitution into (3.7), *without requiring knowledge of the class structure* (2.4)-(2.5). In the same way, we can estimate the **Cov**_{0} for ** in (2.30) by the standard multinomial formula **

$${\left({\widehat{\mathbf{Cov}}}_{0}\right)}_{\mathit{kl}}=\frac{1}{N}\left\{{\widehat{F}}_{\mathrm{max}(k,l)}-{\widehat{F}}_{k}{\widehat{F}}_{l}\right\}.$$

(3.8)

Under some circumstances, a similar tactic can be applied to estimate the correlation penalties **cov**_{1} and **Cov**_{1}, (2.23) and (2.33). The first and second derivatives *f*^{(1)}(*z*) and *f*^{(2)}(*z*) of (3.5) are

$${f}^{\left(1\right)}\left(z\right)=\sum _{c}{p}_{c}{\phi}^{\left(1\right)}\left(\frac{z-{\mu}_{c}}{{\sigma}_{c}}\right)\frac{1}{{\sigma}_{c}^{2}}\phantom{\rule{1em}{0ex}}\text{and}\phantom{\rule{1em}{0ex}}{f}^{\left(2\right)}\left(z\right)=\sum _{c}{p}_{c}{\phi}^{\left(2\right)}\left(\frac{z-{\mu}_{c}}{{\sigma}_{c}}\right)\frac{1}{{\sigma}_{c}^{3}}.$$

(3.9)

Suppose we make the *homogeneity assumption* that all *σ _{c}* values are the same, say

$${\stackrel{\u2012}{\phi}}^{\left(1\right)}={\sigma}_{0}^{2}{\mathit{f}}^{\left(1\right)}\phantom{\rule{1em}{0ex}}\text{and}\phantom{\rule{1em}{0ex}}{\stackrel{\u2012}{\varphi}}^{\left(2\right)}={\sigma}_{0}^{2}{\mathit{f}}^{\left(2\right)}$$

(3.10)

with *f*^{(j)} = (*f*^{(j)}(*x _{k}*))′. This leads to the convenient covariance penalty formulas,

$${\mathbf{Cov}}_{1}\doteq \frac{{\left({\sigma}_{0}^{2}\alpha \right)}^{2}}{2}{\mathit{f}}^{\left(1\right)}{\mathit{f}}^{{\left(1\right)}^{\prime}}\phantom{\rule{1em}{0ex}}\text{and}\phantom{\rule{1em}{0ex}}{\mathbf{cov}}_{1}\doteq \frac{{\left(N\Delta {\sigma}_{0}^{2}\alpha \right)}^{2}}{2}{\mathit{f}}^{\left(2\right)}{\mathit{f}}^{{\left(2\right)}^{\prime}}$$

(3.11)

A smooth estimate (*z*) of *f*(*z*) can be differentiated to give estimated values of **Cov**_{1} and **cov**_{1}, for example,

$${\mathrm{sd}}_{1}\left\{{\widehat{F}}_{k}\right\}={\left({\widehat{\mathbf{Cov}}}_{1}\right)}_{\mathit{kk}}^{\frac{1}{2}}=\frac{{\widehat{\sigma}}_{0}^{2}\widehat{\alpha}}{\sqrt{2}}\mid {\widehat{f}}^{\left(1\right)}\left({x}_{k}\right)\mid $$

(3.12)

for the correlation penalty standard deviation of (*x _{k}*) (2.27). (This provides the second term in formula (1.4).) The heavy curve in Figure 4 shows (3.12) for the leukemia data, using

Suppose we are unwilling to make the homogeneity assumption. A straightforward approach to estimating **Cov**_{1} or **cov**_{1} requires assessments of the parameters (*p _{c}*,

The dashed curve in Figure 4 shows sd_{1}(* _{k}*) estimated directly from (2.32)-(2.33) using the values in Table 3. It is similar to the homogeneity estimate (3.12) except in the extreme tails.

Formula (1.4) for the standard deviation of (*x*) was tested in a simulation experiment. The specifications were the same as in Figure 3, with *N* = 6000, *α* = .10, and two classes of *z*-values (2.24). One hundred ** X** matrices were generated as in the simulation for Table 2, each yielding a vector of 6000 correlated

The solid curve in Figure 5 shows the average of the $\widehat{\mathrm{sd}}$ values for *x* between −4 and 4.5, with solid bars indicating standard deviations of the 100 $\widehat{\mathrm{sd}}\u2019\mathrm{s}$. There is a good match of the average with the exact sd curve from Figure 3. The error bars indicate moderate variability across the replications. The average for ${\widehat{\mathrm{sd}}}_{0}$, dashed curve, agrees with the corresponding curve in Figure 3 and shows that correlation cannot be ignored in this situation.

Correlation usually degrades statistical accuracy, an important question for the data analyst being the severity of its effects on the estimates and tests at hand. The purpose of Section 2 and Section 3 was to develop practical methods for honestly assessing the accuracy of inferences made in the presence of large-scale correlation. This section presents a few examples of the methodology in action.

We have already seen one example: in Table 1 the accuracy of (*x*), the right-sided empirical cdf for the leukemia data, computed from the usual binomial formula that assumes independence among the *z*-values,

$${\widehat{\mathrm{sd}}}_{0}={\left\{\widehat{F}\left(x\right)\left(1-\widehat{F}\left(x\right)\right)/N\right\}}^{\frac{1}{2}},$$

(4.1)

was compared with $\widehat{\mathrm{sd}}$ from formula (1.4) in which the correlation penalty term was included: $\widehat{\mathrm{sd}}$ more than doubled ${\widehat{\mathrm{sd}}}_{0}$ over most of the range.

Suppose we assume, as in Efron (2008), that each of the *N* cases (the *N* genes in the leukemia study) is either *null* or *non-null* with prior probability *p*_{0} or *p*_{1} = 1 − *p*_{0}, and with the corresponding *z*-values having density either *f*_{0}(*z*) or *f*_{1}(*z*),

$$\begin{array}{cc}{p}_{0}=\mathrm{Pr}\left\{\text{null}\right\}\hfill & {f}_{0}\left(z\right)\phantom{\rule{thinmathspace}{0ex}}\text{density if null},\hfill \\ {p}_{1}=\mathrm{Pr}\left\{\text{non-null}\right\}\hfill & {f}_{1}\left(z\right)\phantom{\rule{thinmathspace}{0ex}}\text{density if non-null}.\hfill \end{array}$$

(4.2)

Let *F*_{0} and *F*_{1} be the right-sided cdfs of *f*_{0} and *f*_{1}, and *F* the mixture cdf

$$F\left(x\right)={p}_{0}{F}_{0}\left(x\right)+{p}_{1}{F}_{1}\left(x\right).$$

(4.3)

The probability of a case being null given that *z* exceeds *x* is

$$\mathrm{Fdr}\left(x\right)\equiv \mathrm{Pr}\{\text{null}\mid z\ge x\}=\frac{{p}_{0}{F}_{0}\left(x\right)}{F\left(x\right)}$$

(4.4)

according to Bayes theorem, “Fdr” standing for false discovery rate.

If *p*_{0} and *F*_{0} are known then Fdr has the obvious estimate

$$\widehat{\mathrm{Fdr}}\left(x\right)={p}_{0}{F}_{0}\left(x\right)\u2215\widehat{F}\left(x\right)$$

(4.5)

(2.1). Benjamini and Hochberg's celebrated 1995 algorithm uses $\widehat{\mathrm{Fdr}}\left(x\right)$ for simultaneous hypothesis testing, but it can also be thought of as an empirical Bayes estimator of the Bayesian probability Fdr(*x*). The bottom row of Table 1 shows $\widehat{\mathrm{Fdr}}\left(x\right)$ for the leukemia data, taking *p*_{0} = .93 and *F*_{0} ~ *N*(.09, 1.68^{2}) as in Figure 1. (Later we will do a more ambitious calculation taking into account the estimation of *p*_{0} and *F*_{0}.)

The coefficient of variation for $\widehat{\mathrm{Fdr}}\left(x\right)$ approximately equals that for (*x*) (when *p*_{0}*F*_{0}(*x*) is known in (4.5)). At *x* = 5 we have $\widehat{\mathrm{Fdr}}\left(5\right)=.15$, with coefficient of variation about .19. An $\widehat{\mathrm{Fdr}}$ of .15 might be considered small enough to trigger significance in the Benjamini–Hochberg algorithm, but in any case it seems clear that the probabiltiy of being null is quite low for the 71 genes having *z _{i}* above 5. Even taking account of correlation effects, we have a rough upper confidence limit of .21 (i.e., .15 · (1 + 2 · .19)) for Fdr(5).

Next we consider accuracy estimates for a general class of statistics *Q*(** y**), where

$$\mathit{dQ}=\widehat{\mathit{D}}\mathit{dy}\phantom{\rule{1em}{0ex}}\left[{\widehat{D}}_{\mathit{jk}}=\partial {Q}_{j}\u2215\partial {y}_{k}\right].$$

(4.6)

If $\widehat{\mathbf{cov}}\left(\mathit{y}\right)$ is a covariance estimate for ** y**, obtained perhaps as in (2.12), (3.8), or (3.11), then the usual delta-method estimate for cov(

$$\widehat{\mathbf{cov}}\left(Q\right)=\widehat{\mathit{D}}\widehat{\mathbf{cov}}\left(\mathit{y}\right){\widehat{\mathit{D}}}^{\prime}.$$

(4.7)

In a theoretical context, where **cov**(** y**) is known, we might instead use

$$\mathrm{cov}\left(Q\right)\doteq \mathit{D}\mathbf{cov}\left(\mathit{y}\right){\mathit{D}}^{\prime}$$

(4.8)

now with the derivative matrix ** D** evaluated at the expectation of

Model (4.2) yields the *local false discovery rate*

$$\mathrm{fdr}\left(x\right)\equiv \mathrm{Pr}\{\text{null}\mid z=x\}={p}_{0}{f}_{0}\left(x\right)\u2215f\left(x\right),$$

(4.9)

*f*(*x*) being the mixture density

$$f\left(x\right)={p}_{0}{f}_{0}\left(x\right)+{p}_{1}{f}_{1}\left(x\right);$$

(4.10)

fdr(*x*) is inferentially more appropriate than Fdr(*x*) from a Bayesian point of view, but it is not as immediately available since it involves estimating the *density **f*(*x*). However, because *z*-value densities are mixtures of near-normals as shown in Section 5, it is usually straightforward to carry out the estimation.

`Locfdr`, the algorithm discussed in Efron (2007a, 2008), estimates *f*(*x*) by means of Poisson regression of the counts *y _{k}* as a spline function of the

$$d\widehat{\ell}=\widehat{\mathit{D}}d\mathit{y}\phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}}\text{with}\phantom{\rule{thinmathspace}{0ex}}\widehat{\mathit{D}}=\mathit{M}{\left({\mathit{M}}^{\prime}\phantom{\rule{thinmathspace}{0ex}}\text{diag}\left(N\Delta \widehat{\mathit{f}}\right)\mathit{M}\right)}^{-1}{\mathit{M}}^{\prime}$$

(4.11)

For any function *v*(*x*) define the vector

$$\mathit{v}={({v}_{1},{v}_{2},\dots ,{v}_{k},\dots ,{v}_{K})}^{\prime}={(\dots ,v\left({x}_{k}\right),\dots )}^{\prime}$$

(4.12)

as with ** and **** above. If **

$$\widehat{\mathrm{lfdr}}\left(x\right)\equiv \mathrm{log}\left(\widehat{\mathrm{fdr}}\left(x\right)\right)=\mathrm{log}\phantom{\rule{thinmathspace}{0ex}}\left({p}_{0}{f}_{0}\left(x\right)\right)-\mathrm{log}\phantom{\rule{thinmathspace}{0ex}}\left(f\left(x\right)\right)$$

(4.13)

then

$$\widehat{\mathbf{lfdr}}\left(x\right)=\mathrm{log}\phantom{\rule{thinmathspace}{0ex}}\left({p}_{0}\right)+\mathrm{log}\phantom{\rule{thinmathspace}{0ex}}\left({\mathit{f}}_{0}\right)-\widehat{\ell}$$

(4.14)

implying, if *p*_{0} and *f*_{0} are known, that

$$\mathbf{cov}\left(\widehat{\mathbf{lfdr}}\left(x\right)\right)=\mathrm{cov}\left(\widehat{\ell}\right)\doteq \mathit{D}\mathbf{cov}\left(\mathit{y}\right){\mathit{D}}^{\prime}$$

(4.15)

with ** D** =

The solid curves in Figure 6 plot standard deviations for $\mathrm{log}\left(\widehat{\mathrm{fdr}}\left(x\right)\right)$, obtained as square root of the diagonal elements of $\mathbf{cov}\left(\widehat{\mathbf{lfdr}}\right)$ (4.15), for model (2.24) with *N* = 6000 and rms correlation *α* equal 0, .1, or .2; see Remark D. The horizontal axes are plotted in terms of the upper percentiles of *F*(*x*), the right end of each plot corresponding to the far right tail of the *z*-value distribution. For *α* = 0, $\mathrm{sd}\left(\mathrm{log}\phantom{\rule{thinmathspace}{0ex}}\widehat{\mathrm{fdr}}\left(x\right)\right)$ increases from .03 to .08 as we move from the fifth to the first percentile of *F*. The coefficient of variation (CV) of $\widehat{\mathrm{fdr}}\left(x\right)$ nearly equals $\mathrm{sd}\left(\mathrm{log}\phantom{\rule{thinmathspace}{0ex}}\widehat{\mathrm{fdr}}\left(x\right)\right)$, so $\widehat{\mathrm{fdr}}\left(x\right)$ is quite accurately estimated for *α* = 0, but substantially less so for *α* = .2. Reducing *N* to 1500 doubles the standard deviation estimates for *α* = 0, but has less effect in the correlated situations: for *α* = .1 for example, the increase is only 20% at percentile .025. Simulations confirmed the correctness of these results.

Intuitively it seems that fdr should be harder to estimate than Fdr, but that is not what Figure 6 shows. Let * _{k}* = log((

$${\widehat{D}}_{\mathit{jk}}={B}_{\mathit{jk}}\u2215\left(N\cdot {\widehat{F}}_{j}\right)$$

(4.16)

with ** B** as in (2.25), giving an estimate of

Here we are comparing the nonparametric estimator $\widehat{\mathrm{Fdr}}\left(x\right)$ (4.5) with the parametric estimator $\widehat{\mathrm{fdr}}\left(x\right)$. The Poisson spline estimate (*x*) that gave $\widehat{\mathrm{fdr}}\left(x\right)$ can be summed to give parametric estimates of *F*(*x*) and Fdr(*x*), say $\stackrel{~}{\mathrm{Fdr}}\left(x\right)$. Straightforward calculations show that the derivative matrix ** for $\stackrel{~}{\mathrm{Fdr}}\left(x\right)$ is **

$$\widehat{\mathit{D}}=\widehat{\mathit{C}}{\widehat{\mathit{D}}}_{f}\phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}}\text{where}\phantom{\rule{thinmathspace}{0ex}}{C}_{\mathit{jk}}={B}_{\mathit{jk}}{\widehat{f}}_{k}\u2215{\widehat{F}}_{j}$$

(4.17)

with ** B** from (2.25) and

All of these calculations assumed that *p*_{0} and *f*_{0}(*z*) (or *F*_{0}(*z*)) in (4.2) were known. This is unrealistic in situations like the leukemia study, where there is clear evidence that a textbook *N*(0, 1) theoretical null distribution is too narrow. Estimating an “empirical null” distribution, such as *N*(.09, 1.68^{2}) in Figure 1, is both necessary and feasible (see Efron, 2008) but can greatly increase variability, as discussed next.

Formula (4.14) becomes

$$\widehat{\mathbf{lfdr}}=\mathrm{log}\phantom{\rule{thinmathspace}{0ex}}\left({\widehat{p}}_{0}\right)+\mathrm{log}\left({\widehat{\mathit{f}}}_{0}\right)-\widehat{\ell}$$

(4.18)

when *p*_{0} and *f*_{0} are themselves estimated. The corresponding derivative matrix $\widehat{\mathit{D}}=d\widehat{\mathbf{lfdr}}\u2215d\mathit{y}$ in (4.6) appears as equation (5.8) in Efron (2007b), this formula applying to the *central matching* method for estimating *p*_{0}*f*_{0}(*z*). The second row of Table 4 shows $\left\{\mathrm{log}\phantom{\rule{thinmathspace}{0ex}}\widehat{\mathrm{fdr}}\left(x\right)\right\}$ obtained from ** D**cov(

Comparison of $\left\{\mathrm{log}\left(\widehat{\mathrm{fdr}}\left(x\right)\right)\right\}$ using empirical null versus theoretical null for the situation in the middle panel of Figure 6. The empirical null standard deviations are much larger, as seen also in Efron (2007b).

Here are some points to note:

- Accuracy is worse for $\mathrm{log}\left(\widehat{\mathrm{Fdr}}\right)$ then for $\mathrm{log}\left(\widehat{\mathrm{fdr}}\right)$ in the top line of Table 4.
- Accuracy is somewhat better when
*p*_{0}*f*_{0}(*z*) is estimated by the MLE option in`locfdr`(Lemma 2 of Efron, 2007b). - The big empirical null standard deviations in Table 4 are at least partially misleading: some of the variability in $1+O\left({n}^{-\frac{1}{2}}\right)$ is “signal” rather than “noise”, tracking conditional changes in the appropriate value of fdr(
*x*). See Figure 2 of Efron (2007a) and the discussion in that paper.

Remark H of Section 6 describes a parametric bootstrap resampling scheme that avoids the Taylor series computations of (4.7), but which has not yet been carefully investigated.

The results of the previous sections depend on the variates *z _{i}* having normal distributions (1.5). By definition, a

Figure 7 illustrates the phenomenon for the case of non-central *t* distributions,

$$z={\Phi}^{-1}\left({F}_{\nu}\left(t\right)\right)\phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}}t\sim {t}_{\nu}\left(\delta \right),$$

(5.1)

the notation indicating a non-central *t* variable with *ν* degrees of freedom and non-centrality parameter *δ* (*not δ*^{2}), as described in Chapter 31 of Johnson and Kotz (1970). Here, as in (1.2), *F _{ν}* is the cdf of a

To begin the theoretical development, suppose that *y*_{1}, *y*_{2}, … , *y*_{n} are independent and identically distributed (iid) observations sampled from *F _{θ}*, a member of a one-parameter family of distributions,

$$\mathcal{F}=\{{F}_{\theta},\theta \in \u03f4\}$$

(5.2)

having its moment parameters {mean, standard deviation, skewness, kurtosis}, denoted

$$\{{\mu}_{\theta},{\sigma}_{\theta},{\gamma}_{\theta},{\delta}_{\theta}\},$$

(5.3)

defined differentiably in *θ*. The results that follow are heuristic in the sense that they only demonstrate second-order Cornish–Fisher expansion properties, with no attempt to provide strict error bounds.

Under the null hypothesis *H*_{0} : *θ* = 0, which we can write as

$${H}_{0}:y\sim \{{\mu}_{0},{\sigma}_{0},{\gamma}_{0},{\delta}_{0}\},$$

(5.4)

the standardized variate

$${Y}_{0}=\sqrt{n}\left(\frac{\stackrel{-}{y}-{\mu}_{0}}{{\sigma}_{0}}\right)\phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}}\left[\stackrel{-}{y}=\sum _{i=1}^{n}{y}_{i}\u2215n\right]$$

(5.5)

satisfies

$${H}_{0}:{Y}_{0}\sim \left\{0,1,\frac{{\gamma}_{0}}{\sqrt{n}},\frac{{\delta}_{0}}{n}\right\}.$$

(5.6)

Normality can be improved to second order by means of a Cornish–Fisher transformation,

$${Z}_{0}={Y}_{0}-\frac{{\gamma}_{0}}{6\sqrt{n}}\phantom{\rule{thinmathspace}{0ex}}({Y}_{0}^{2}-1)$$

(5.7)

which reduces the skewness in (5.6) from $O\left({n}^{-\frac{1}{2}}\right)$ to *O*(*n*^{−1}),

$${H}_{0}:{Z}_{0}\sim \{0,1,0,0\}+O\phantom{\rule{thinmathspace}{0ex}}\left({n}^{-1}\right).$$

(5.8)

See Chapter 1 of Johnson and Kotz (1970) or, for much greater detail, Section 2.2 of Hall (1992). We can interpret (5.8) as saying that *Z*_{0} is a *second-order z*-*value*,

$${H}_{0}:{Z}_{0}\sim \mathcal{N}(0,1)+{O}_{p}\phantom{\rule{thinmathspace}{0ex}}\left({n}^{-1}\right),$$

(5.9)

e.g., *a test statistic giving standard normal p*-*values accurate to O*(*n*^{−1}).

Suppose now that *H*_{0} is false, and instead *H*_{1} is true, with *y*_{1}, *y*_{2}, … , *y*_{n} iid according to

$${H}_{1}:y\sim \{{\mu}_{1},{\sigma}_{1},{\gamma}_{1},{\delta}_{1}\}$$

(5.10)

rather than (5.4). Setting

$${Y}_{1}=\sqrt{n}\left(\frac{\stackrel{-}{y}-{\mu}_{1}}{{\sigma}_{1}}\right)\phantom{\rule{1em}{0ex}}\text{and}\phantom{\rule{1em}{0ex}}{Z}_{1}={Y}_{1}-\frac{{\gamma}_{1}}{6\sqrt{n}}({Y}_{1}^{2}-1)$$

(5.11)

makes *Z*_{1} second-order normal under *H*_{1},

$${H}_{1}:{Z}_{1}\sim \mathcal{N}(0,1)+{O}_{p}\left({n}^{-1}\right).$$

(5.12)

We wish to calculate the distribution of *Z*_{0} (5.7) under *H*_{1}. Define

$$c={\sigma}_{1}\u2215{\sigma}_{0},\phantom{\rule{1em}{0ex}}d=\sqrt{n}({\mu}_{1}-{\mu}_{0})\u2215{\sigma}_{0},\phantom{\rule{1em}{0ex}}\text{and}\phantom{\rule{1em}{0ex}}{g}_{0}={\gamma}_{0}\u2215\left(6\sqrt{n}\right).$$

(5.13)

Some simple algebra yields the following relationship between *Z*_{0} and *Z*_{1}.

*Under definitions* (5.7), (5.11) *and* (5.13),

$${Z}_{0}=M+{\mathit{SZ}}_{1}+{g}_{0}\left\{\left(\frac{{\gamma}_{1}}{{\gamma}_{0}}S-{c}^{2}\right)({Y}_{1}^{2}-1)+(1-{c}^{2})\right\}$$

(5.14)

*where*

$$M=d\cdot (1-{\mathit{dg}}_{0})\phantom{\rule{1em}{0ex}}\mathit{and}\phantom{\rule{1em}{0ex}}S=c\cdot (1-2{\mathit{dg}}_{0}).$$

(5.15)

The asymptotic relationships claimed at the start of this section are easily derived from Lemma 3. We consider a sequence of alternatives *θ _{n}* approaching the null hypothesis value

$${\theta}_{n}-{\theta}_{0}=O\left({n}^{-\frac{1}{2}}\right).$$

(5.16)

The parameter *d* = *√n*(*μ _{θn}* −

$$M=d(1-{\mathit{dg}}_{0})=d(1-d{\gamma}_{0}\u2215\left(6\sqrt{n}\right)),$$

(5.17)

while standard Taylor series calculations give

$$c=1+\frac{{\stackrel{.}{\sigma}}_{0}}{{\stackrel{.}{\mu}}_{0}}\frac{d}{\sqrt{n}}+O\left({n}^{-1}\right)\phantom{\rule{1em}{0ex}}\text{and}\phantom{\rule{1em}{0ex}}S=1+\left(\frac{{\stackrel{.}{\sigma}}_{0}}{{\stackrel{.}{\mu}}_{0}}-\frac{{\gamma}_{0}}{3}\right)\frac{d}{\sqrt{n}}+O\left({n}^{-1}\right),$$

(5.18)

the dot indicating differentiation with respect to *θ*.

*Under model* (5.2), (5.16), *and the assumptions of Lemma 3,*

$${Z}_{0}\sim \mathcal{N}(M,{S}^{2})+{O}_{p}\left({n}^{-1}\right)$$

(5.19)

*with M and S as given in* (5.17)-(5.18). *Moreover,*

$$\frac{\mathit{dS}}{\mathit{dM}}{\mid}_{{\theta}_{0}}=\frac{1}{\sqrt{n}}\left(\frac{d\sigma}{d\mu}{\mid}_{{\theta}_{0}}-\frac{{\gamma}_{0}}{3}\right)+O\left({n}^{-1}\right).$$

(5.20)

*Proof*. The proof of Theorem 2 uses Lemma 3, with *θ _{n}* playing the role of

Theorem 2 supports our claim that, under non-null alternatives, the null hypothesis normality of *Z*_{0} degrades more slowly than its unit standard deviation, the comparison being *O _{p}*(

One-parameter exponential families are an important special case of (5.2). With *θ* the natural parameter of and *y* its sufficient statistic, i.e., with densities proportional to exp{*θ _{y}*}

$$\frac{\mathit{dS}}{\mathit{dM}}{\mid}_{{\theta}_{0}}=\frac{{\gamma}_{0}}{6\sqrt{n}}+O\left({n}^{-1}\right).$$

(5.21)

The parameter *γ*_{0}=(6√*n*) is called the *acceleration* in Efron (1987), interpreted as “the rate of change of standard deviation with respect to expectation on the normalized scale,” which agrees with its role in (5.21).

As an example, suppose ${y}_{1},{y}_{2},\dots ,{y}_{n}\stackrel{\mathrm{iid}}{\sim}\theta {\Gamma}_{1}$, Γ* _{n}* indicating a standard gamma distribution with

$${Z}_{0}={\Phi}^{-1}\left({G}_{n}(n\stackrel{-}{y}\u2215{\theta}_{0})\right)$$

(5.22)

where *G _{n}* is the cdf of Γ

So far we have considered *z*-values obtained from an average of iid observations, but the results of Theorem 2 hold in greater generality. Section 5 of Efron (1987) considers one-parameter families where , an estimator of *θ* , has MLE-like asymptotic properties in terms of its bias, standard deviation, skewness and kurtosis,

$$\widehat{\theta}\sim \{\theta +{\beta}_{\theta}\u2215n,{\sigma}_{\theta}\u2215\sqrt{n},{\gamma}_{\theta}\u2215\sqrt{n},{\delta}_{\theta}\u2215n\}.$$

(5.23)

Letting play the role of and *μ _{θ}* =

Moving beyond one-parameter families, suppose is a *p*-parameter exponential family, having densities proportional to $\mathrm{exp}\{{\eta}_{1}{x}_{1}+{\eta}_{2}^{\prime}{x}_{2}\}{g}_{0}({x}_{1},{x}_{2})$, where *η*_{1} and *x*_{1} are real-valued while *η*_{2} and *x*_{2} are (*p* − 1)-dimensional vectors, but where we are only interested in *η*_{1}, not the nuisance vector *η*_{2}. The *conditional* distribution of *x*_{1} given *x*_{2} is then a one-parameter exponential family with natural parameter *η*_{1}, which puts us back in the context of Theorem 2. Remark H of Section 6 suggests a further extension where the parameter of interest “ *θ*” can be a general real-valued function of *η*, not just a coordinate such as *η*_{1}.

The non-central *t* family does not meet the conditions of Lemma 3 or Theorem 2: (5.1) is symmetric in *δ* around zero, causing *γ*_{0} in (5.14) to equal zero and likewise the derivative in (5.20). Nevertheless, as Figure 7 shows, it does exhibit impressive non-null normality. Table 6 displays the moment parameters of *z* = Φ^{−1}(*F _{ν}*(

Microarray studies can be more elaborate than two-sample comparisons. Suppose that in addition to the *N* × *n* expression matrix ** X** we have measured a primary response variable

$${z}_{i}={\Phi}^{-1}\left({F}_{n-p-1}\left({t}_{i}\right)\right)$$

(5.24)

is a *z*-value (1.1) under the usual Gaussian assumption, showing behavior like that in Table 6 for non-null genes.

Some remarks, proofs, and details relating to the previous sections are presented here.

The curve (*z*) in Figure 1 is a Poisson regression fit to the counts *y _{k}*, as a natural spline function of the bin centers

Section 3 of Efron (2008) defines the *non-null counts* ${y}_{k}^{\left(1\right)}=(1-\widehat{\mathrm{fdr}}\left({x}_{k}\right))\cdot {y}_{k}$. Since, under model (4.2), $1-\widehat{\mathrm{fdr}}\left({x}_{k}\right)$ estimates the proportion of non-null *z*-values in bin *k*, ${y}_{k}^{\left(1\right)}$ estimates the number of non-nulls. The ${y}_{k}^{\left(1\right)}$ values are plotted below the x axis in Figure 1, determining the “left” and “right” distribution parameters in Table 3. “Center” was determined by the empirical null fit from `locfdr`, using the MLE method described in Section 4 of Efron (2007b). This method tends to underestimate the non-null counts near *z* = 0, and also the *σ _{c}* values for the left and right classes, but increasing then to 1.68 had little effect on the dashed curve in Figure 4.

Let *I _{k}*(

$${y}_{\mathit{kc}}=\sum _{\mathit{c}}{I}_{k}\left(i\right),$$

(6.1)

the boldface subscript indicating summation over the members of * _{c}*. We first compute

$$E\left\{{y}_{\mathit{kc}}{y}_{\mathit{ld}}\right\}=E\left\{\sum _{\mathit{c}}\sum _{\mathit{d}}{I}_{k}\left(i\right){I}_{l}\left(j\right)\right\}={\Delta}^{2}\sum _{\mathit{c}}\sum _{\mathit{d}}{\phi}_{{\rho}_{\mathit{ij}}}({x}_{kc},{x}_{ld})(1-{\chi}_{\mathit{ij}})\u2215{\sigma}_{c}{\sigma}_{d}$$

(6.2)

following notation (2.5)-(2.10), with *χ _{ij}* the indicator function of event

$$E\left\{{y}_{\mathit{kc}}{y}_{\mathit{ld}}\right\}={N}^{2}{\Delta}^{2}{p}_{c}({p}_{d}-{\chi}_{\mathit{cd}}\u2215N){\int}_{-1}^{1}{\phi}_{\rho}({x}_{\mathit{kc}},{x}_{\mathit{ld}})g\left(\rho \right)d\rho \u2215{\sigma}_{c}{\sigma}_{d}$$

(6.3)

under the assumption that the same correlation distribution *g*(*ρ*) applies across all class combinations. Since *y _{k}* = Σ

$$E\left\{{y}_{k}{y}_{l}\right\}={N}^{2}{\Delta}^{2}\sum _{c}\sum _{d}{p}_{c}({p}_{d}-{\chi}_{\mathit{cd}}\u2215N){\int}_{-1}^{1}{\phi}_{\rho}({x}_{\mathit{kc}},{x}_{\mathit{ld}})g\left(\rho \right)d\rho \u2215{\sigma}_{c}{\sigma}_{d},$$

(6.4)

the non-bold subscripts indicating summation over classes.

Subtracting

$$E\left\{{y}_{k}\right\}E\left\{{y}_{l}\right\}={N}^{2}{\Delta}^{2}\underset{c}{\Sigma}\underset{d}{\Sigma}\phi \left({x}_{\mathit{kc}}\right)\phi \left({x}_{\mathit{ld}}\right)\u2215{\sigma}_{c}{\sigma}_{d}$$

(6.5)

from (6.4) results, after some rearrangement, in

$$\begin{array}{cc}\hfill \mathrm{cov}({y}_{k},{y}_{l})=& {N}^{2}{\Delta}^{2}\underset{c}{\Sigma}\underset{d}{\Sigma}\frac{\phi \left({x}_{\mathit{kc}}\right)\phi \left({x}_{\mathit{ld}}\right)}{{\sigma}_{c}{\sigma}_{d}}\left\{{p}_{c}\left({p}_{d}-\frac{{\chi}_{\mathit{cd}}}{N}\right){\int}_{-1}^{1}\left(\frac{{\phi}_{\rho}({x}_{\mathit{kc}},{x}_{\mathit{ld}})}{\phi \left({x}_{\mathit{kc}}\right)\phi \left({x}_{\mathit{ld}}\right)}-1\right)g\left(\rho \right)d\rho \right\}\hfill \\ \hfill & \phantom{\rule{1em}{0ex}}-N{\Delta}^{2}\underset{c}{\Sigma}{p}_{c}\frac{\phi \left({x}_{\mathit{kc}}\right)\phi \left({x}_{\mathit{ld}}\right)}{{\sigma}_{c}{\sigma}_{d}}.\hfill \end{array}$$

(6.6)

Using *π _{kc}* = Δ · (

The case *k* = *l* proceeds in the same way, the only difference being that *N*Δ*p _{c}χ_{cd}*(

The assumption that *g*(*ρ*) is the same across all classes can be weakened for the rms approximations (2.23) and (2.33), where we only need the second moments *α*_{2} to be the same. In fact, the class structure can disappear entirely for rms formulas, as seen in (3.12).

Specifications (2.24) were recentered to give overall expectation 0 in (2.4), (2.5):

$$({p}_{0},{\mu}_{0},{\sigma}_{0})=(.95,-.125,1)\phantom{\rule{1em}{0ex}}\text{and}\phantom{\rule{1em}{0ex}}({p}_{1},{\mu}_{1},{\sigma}_{1})=(.05,2.38,1),$$

(6.7)

these being the parameter values used in Figures Figures2,2, ,3,3, ,55 and and6.6. Recentering overall expectations to zero is common in practice, a consequence of the data matrix ** X** having its column-wise means subtracted off.

The 6000 × 80 data matrices ** X** used in the simulations for Table 2 and Figure 5 had entries

$${\delta}_{\mathit{ij}}=.224\phantom{\rule{thinmathspace}{0ex}}{\mu}_{1}\phantom{\rule{1em}{0ex}}\text{for}\phantom{\rule{thinmathspace}{0ex}}i=1,2,\dots ,300,\phantom{\rule{1em}{0ex}}\text{and}\phantom{\rule{1em}{0ex}}{\delta}_{\mathit{ij}}=.224\phantom{\rule{thinmathspace}{0ex}}{\mu}_{0}\phantom{\rule{1em}{0ex}}\text{for}\phantom{\rule{thinmathspace}{0ex}}i>300;$$

(6.8)

*z*-values based on the difference of means between the last and first 40 “patients” then satisfy (2.4), (6.7). The correlation distribution *g*(*ρ*) was supported on two points, 20% *ρ* = .20 and 80% *ρ* = −.05, giving *α* = .10.

The original entries *x _{ij}* of the leukemia data matrix were genetic expression levels obtained using Affymetrix oligonucleotide microarrays. For the analyses here, each column of

Section 3 of Efron (2007a) discusses a hierarchical Poisson simulation scheme that can be adapted to the more general context of this paper. Following the notation in (3.11), we first simulate a vector ** u**,

$$\mathit{u}=N\Delta \left(\widehat{\mathit{f}}+\frac{{\sigma}_{0}^{2}}{\sqrt{2}}A{\widehat{\mathit{f}}}^{\left(2\right)}\right)\phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}}\text{with}\phantom{\rule{thinmathspace}{0ex}}A\sim \mathcal{N}(0,{\widehat{\alpha}}^{2}),$$

(6.9)

and then take ** y** ~ Poisson(

Suppose test statistic *t* has possible densities {*f _{θ}*(

$${g}_{\theta}\left(z\right)=\phi \left(z\right){f}_{\theta}\left(t\right)\u2215{f}_{{\theta}_{0}}\left(t\right).$$

(6.10)

The density curves in Figure 7 were obtained from (6.10), with *f _{θ}*(

In some circumstances, Theorem 2 can be extended to multi-parameter families = {*F _{η}*} where we wish to test

The paper considers studies where a large number *N* of cases are under investigation, *N* perhaps in the hundreds or thousands, each represented by its own *z*-value *z _{i}*, and where there is the possibility of substantial correlation among the

- Exact formulas for the accuracy of correlated
*z*-value cdfs are derived under normal distribution assumptions (Section 2). - Practical estimates for the approximation formulas are derived and demonstrated through simulations and application to a microarray study (Section 3 and Section 4).
- Delta-method arguments are used to extend the cdf results to more general summary statistics (Section 3 and Section 4).
- Under reasonable assumptions, it is shown that
*z*scores tend to have nearly normal distributions, even in non-null situations (Section 5), justifying application of the theory to studies in which the individual variates are*z*-values.

Our main conclusion is that by dealing with normal variates, a practical assessment of large-scale correlation effects on statistical estimates is possible.

This work was supported in part by NIH grant 8R01 EB002784 and NSF grant DMS0505673.

- Bolstad B, Irizarry R, Astrand M, Speed T. A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics. 2003;19:185–193. [PubMed]
- Clarke S, Hall P. Robustness of multiple testing procedures against dependence. Ann. Statist. 2009;37:332–358.
- Csörgő S, Mielniczuk J. The empirical process of a short-range dependent stationary sequence under Gaussian subordination. Probab. Theory Related Fields. 1996;104:15–25.
- Desai K, Deller J, McCormick J. The distribution of number of false discoveries for highly correlated null hypotheses. Ann. Appl. Statist. 2009 Submitted, under review.
- Dudoit S, Laan M. J. van der, Pollard KS. Multiple testing. I. Single-step procedures for control of general type I error rates. Stat. Appl. Genet. Mol. Biol. 2004;3:71. Art. 13. electronic. [PubMed]
- Dudoit S, Shaffer JP, Boldrick JC. Multiple hypothesis testing in microarray experiments. Statist. Sci. 2003;18:71–103.
- Efron B. Bootstrap confidence intervals for a class of parametric problems. Biometrika. 1985;72:45–58.
- Efron B. Better bootstrap confidence intervals. J. Amer. Statist. Assoc. 1987;82:171–200. with comments and a rejoinder by the author.
- Efron B. Correlation and large-scale simultaneous significance testing. J. Amer. Statist. Assoc. 2007a;102:93–103.
- Efron B. Size, power and false discovery rates. Ann. Statist. 2007b;35:1351–1377.
- Efron B. Microarrays, empirical Bayes and the two-groups model. Statist. Sci. 2008;23:1–22.
- Efron B. Are a set of microarrays independent of each other? Ann. Appl. Statist. 2009 To appear. [PMC free article] [PubMed]
- Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA, Bloomfield CD, Lander ES. Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring. Science. 1999;286:531–537. [PubMed]
- Hall P. The bootstrap and Edgeworth expansion. Springer-Verlag; New York: 1992. (Springer Series in Statistics).
- Johnson NL, Kotz S. Distributions in statistics. Continuous univariate distributions. 1. Houghton Mifflin Co; Boston, Mass: 1970.
- Lancaster HO. The structure of bivariate distributions. Ann. Math. Statist. 1958;29:719–736.
- Owen AB. Variance of the number of false discoveries. J. R. Stat. Soc. Ser. B Stat. Methodol. 2005;67:411–426.
- Qiu X, Brooks A, Klebanov L, Yakovlev A. The effects of normalization on the correlation structure of microarray data. BMC Bioinformatics. 2005a;6:120. [PMC free article] [PubMed]
- Qiu X, Klebanov L, Yakovlev A. Correlation between gene expression levels and limitations of the empirical Bayes methodology for finding differentially expressed genes. Stat. Appl. Genet. Mol. Biol. 2005b;4:32. Art. 34. electronic. [PubMed]
- Schwartzman A, Lin X. The effect of correlation in false discovery rate estimation. Harvard University; 2009. (Biostatistics Working Paper Series number 106).
- Westfall P, Young S. Resampling-based Multiple Testing: Examples and Methods for p-Value Adjustment. Wiley-Interscience; New York, NY: 1993.

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |