Home | About | Journals | Submit | Contact Us | Français |

**|**HHS Author Manuscripts**|**PMC2980338

Formats

Article sections

- Abstract
- 1. Introduction
- 2. Nonparametric Estimators of Genewise Variance
- 3. Simulations and comparisons
- 4. Application to human total RNA samples using Agilent arrays
- 5. Summary
- References

Authors

Related links

Ann Stat. Author manuscript; available in PMC 2010 November 12.

Published in final edited form as:

Ann Stat. 2010 November 1; 38(5): 2723–2750.

doi: 10.1214/10-AOS802PMCID: PMC2980338

NIHMSID: NIHMS248819

Jianqing Fan, Frederick L. Moore Professor of Finance, Yang Feng, Gaduate student, and Yue S. Niu, Assistant Professor^{}

Jianqing Fan, Department of Operations Research and Financial Engineering, Princeton University, Princeton, NJ 08544;

Jianqing Fan: ude.notecnirp@nafqj; Yang Feng: ude.notecnirp@gnefgnay; Yue S. Niu: ude.anozira.htam@uineuy

See other articles in PMC that cite the published article.

Estimation of genewise variance arises from two important applications in microarray data analysis: selecting significantly differentially expressed genes and validation tests for normalization of microarray data. We approach the problem by introducing a two-way nonparametric model, which is an extension of the famous Neyman-Scott model and is applicable beyond microarray data. The problem itself poses interesting challenges because the number of nuisance parameters is proportional to the sample size and it is not obvious how the variance function can be estimated when measurements are correlated. In such a high-dimensional nonparametric problem, we proposed two novel nonparametric estimators for genewise variance function and semiparametric estimators for measurement correlation, via solving a system of nonlinear equations. Their asymptotic normality is established. The finite sample property is demonstrated by simulation studies. The estimators also improve the power of the tests for detecting statistically differentially expressed genes. The methodology is illustrated by the data from MicroArray Quality Control (MAQC) project.

Microarray experiments are one of widely used technologies nowadays, allowing scientists to monitor thousands of gene expressions simultaneously. One of the important scientific endeavors of microarray data analysis is to detect statistically differentially expressed genes for downstream analysis (Cui, Hwang, and Qiu, 2005; Fan, Tam, Vande Woude and Ren, 2004; Fan and Ren, 2006; Storey and Tibshirani, 2003; Tusher, Tibshirani, and Chu, 2001). Standard *t*-test and *F*-test are frequently employed. However, due to the cost of the experiment, it is common to see a large number of genes with a small number of replications. Even in customized arrays where only several hundreds of genes expressions are measured, the number of replications is usually limited. As a result, we are facing a high dimensional statistical problem with a large number of parameters and a small sample size.

Genewise variance estimation arises at the heart of microarray data analysis. To select differentially expressed genes among thousands of genes, the *t*-test is frequently employed with a stringent control of type I errors. The degree of freedom is usually small due to limited replications. The power of the test can be significantly improved if the genewise variance can be estimated accurately. In such a case, the *t*-test becomes basically a *z*-test. A simple genewise variance estimator is the sample variance of replicated data, which is not reliable due to a relatively small number of replicated genes. They have direct impact on the sensitivity and specificity of *t*-test (Cui et al., 2005). Therefore, novel methods for estimating the genewise variances are needed for improving the power of the standard *t*-test.

Another important application of genewise variance estimation arises from testing whether systematic biases have been properly removed after applying some normalization method, or selecting the most appropriate normalization technique for a given array. Fan and Niu (2007) developed such validation tests (see Section 4), which require the estimation of genewise variance. The methods of variance estimation, like pooled variance estimator, and REML estimator (Smyth, Michaud, and Scott, 2005), are not accurate enough due to the small number of replications.

Due to the importance of genewise variance in microarray data analysis, conscientious efforts have been made to accurately estimate it. Various methods have been proposed under different models and assumptions. It has been widely observed that genewise variance is to a great extent related to the intensity level. Kamb and Ramaswami (2001) proposed a crude regression estimation of variance from microarray control data. Tong and Wang (2007) discussed a family of shrinkage estimators to improve the accuracy.

Let *R _{gi}* and

$${Y}_{gi}={log}_{2}({G}_{gi}/{R}_{gi}),\phantom{\rule{0.38889em}{0ex}}\text{and}\phantom{\rule{0.38889em}{0ex}}{X}_{gi}=\frac{1}{2}{log}_{2}({G}_{gi}{R}_{gi}),\phantom{\rule{0.38889em}{0ex}}\phantom{\rule{0.38889em}{0ex}}\phantom{\rule{0.38889em}{0ex}}i=1,\cdots ,I,g=1,\cdots ,N,$$

where *I* is the number of replications for each gene and *N* is the number of genes with replications. For the purpose of estimating genewise variance, we assume that there is no systematic biases or the systematic biases have been removed by a certain normalization method. This assumption is always made for selecting significantly differentially expressed genes or validation test under the null hypothesis. Thus we have

$${Y}_{gi}={\alpha}_{g}+{\sigma}_{gi}{\epsilon}_{gi},$$

with *α _{g}* denoting the log-ratio of gene expressions in the treatment and control samples. Here, (

In the papers by Wang, Ma, and Carroll (2008) and Carroll and Wang (2008), nonparametric measurement-error models have been introduced to aggregate the information of estimating the genewise variance:

$${Y}_{gi}={\alpha}_{g}+\sigma ({\alpha}_{g}){\epsilon}_{gi},\phantom{\rule{0.38889em}{0ex}}\phantom{\rule{0.38889em}{0ex}}\phantom{\rule{0.38889em}{0ex}}\text{corr}({\epsilon}_{gi},{\epsilon}_{{gi}^{\prime}})=0,g=1,\cdots ,N,i=1,\cdots ,I.$$

(1)

The model is intended for the analysis of the Affymetrix array (one-color array) data in which *α _{g}* represents the expected intensity level, and

To overcome these drawbacks in the applications to microarray data and to utilize the observed intensity information, we assume that *σ _{gi}* =

$${Y}_{gi}={\alpha}_{g}+\sigma ({X}_{gi}){\epsilon}_{gi},\phantom{\rule{0.38889em}{0ex}}g=1,\cdots ,N,i=1,\cdots ,I,$$

(2)

for estimating genewise variance. This model is clearly an extension of the Neyman-Scott problem (Neyman and Scott, 1948), in which the genewise variance is a constant. The Neyman-Scott problem has many applications in astronomy. Note that the number of nuisance parameters {*α _{g}*} is proportional to the sample size. This imposes an important challenge to the nonparametric problem. It is not even clear whether the function

To estimate the genewise variance in their microarray data analysis, Fan et al. (2004) assumed a model similar to (2). But in the absence of other available techniques, they had to impose that the treatment effect {*α _{g}*} is also a smooth function of the intensity level so that they can apply nonparametric methods to estimate genewise variance (Ruppert et al., 1997). However, this assumption is not valid in most microarray applications, and the estimator of genewise variance incurs big biases unless {

We propose a novel nonparametric approach to estimate the genewise variance. We first study a benchmark case when there is no correlation between replications, i.e., *ρ* = 0. This corresponds to the case with independent replications across arrays (Fan et al., 2005; Huang et al., 2005). It is also applicable to those dealt by the Neyman-Scott problem. By noticing *E*{(*Y _{gi}* −

Model (2) can be applied to the microarrays in which within-array replications are not available. In that case, we can aggregate all the microarrays together and view them as a super array with replications (Fan et al., 2005; Huang et al., 2005). In other words, *i* in (2) indexes arrays and *ρ* can be taken as 0, namely (2) is the across-array replication with *ρ* = 0.

The structure of this paper is as follows. In Section 2 we discuss the estimation schemes of the genewise variance and establish the asymptotic properties of the estimators. Simulation studies are given in Section 3 to verify the performance of our methods in the finite sample. Applications to the data from MicroArray Quality Control (MAQC) project are showed in Section 4 to illustrate the proposed methodology. In Section 5, we give a short summary. Technical proofs are relegated to the Appendix.

We first consider the specific case where there is no correlation among the replications *Y _{g}*

$$E[{({Y}_{gi}-{\overline{Y}}_{g})}^{2}\mid \mathbf{X}]={(I-1)}^{2}{\sigma}^{2}({X}_{gi})/{I}^{2}+\sum _{j\ne i}{\sigma}^{2}({X}_{gj})/{I}^{2},\phantom{\rule{0.38889em}{0ex}}i=1,\cdots ,I.$$

We will discuss in §2.4 the case that *I* = 2. For *I* > 2, we have *I* different equations with *I* unknowns *σ*^{2}(*X _{g}*

$${\mathbf{r}}_{g}={({({Y}_{g1}-{\overline{Y}}_{g})}^{2},\cdots ,{({Y}_{gI}-{\overline{Y}}_{g})}^{2})}^{T},\phantom{\rule{0.16667em}{0ex}}\text{and}\phantom{\rule{0.16667em}{0ex}}{\mathit{\sigma}}_{g}^{2}={({\sigma}^{2}({X}_{g1}),\cdots ,{\sigma}^{2}({X}_{gI}))}^{T}.$$

Then, it can easily be shown that
${\mathit{\sigma}}_{g}^{2}=\mathbf{B}\text{E}[{\mathbf{r}}_{g}\mid \mathbf{X}]$, where **B** is the coefficient matrix:

$$\mathbf{B}=(({I}^{2}-I)\mathbf{I}-\mathbf{E})/(I-1)(I-2),$$

with **I** being the *I × I* identity matrix and **E** the *I × I* matrix with all elements 1. Define

$${\mathbf{Z}}_{g}={({Z}_{g1},\cdots ,{Z}_{gI})}^{T}\triangleq \mathbf{B}{\mathbf{r}}_{\mathbf{g}}.$$

Then, we have

$${\sigma}^{2}({X}_{gi})=\text{E}[{Z}_{gi}\mid \mathit{X}].$$

(3)

Note that the left hand side of (3) depends only on *X _{gi}*, not other variables. By the the double expectation formula, it follows that the variance function

$${\sigma}^{2}(x)=\text{E}[{Z}_{gi}\mid {X}_{gi}=x],\phantom{\rule{0.38889em}{0ex}}\phantom{\rule{0.38889em}{0ex}}\phantom{\rule{0.38889em}{0ex}}i=1,\cdots ,I.$$

(4)

Using the synthetic data {(*X _{gi}, Z_{gi}*),

$${\widehat{\eta}}_{i}^{2}(x)=\sum _{g=1}^{N}{W}_{N,i}\left(\frac{{X}_{gi}-x}{h}\right){Z}_{gi},\phantom{\rule{0.38889em}{0ex}}\phantom{\rule{0.38889em}{0ex}}\phantom{\rule{0.38889em}{0ex}}i=1,\cdots ,I,$$

(5)

with

$${W}_{N,i}(u)={h}^{-1}K(u)\frac{{S}_{N,2}-{uS}_{N,1}}{{S}_{N,2}{S}_{N,0}-{S}_{N,1}^{2}},$$

where *K _{h}*(

Denote

$${c}_{K}={\int}_{-\infty}^{\infty}{u}^{2}K(u)du,\phantom{\rule{0.38889em}{0ex}}\phantom{\rule{0.38889em}{0ex}}\phantom{\rule{0.38889em}{0ex}}{d}_{K}={\int}_{-\infty}^{\infty}{K}^{2}(u)du,\phantom{\rule{0.38889em}{0ex}}\phantom{\rule{0.38889em}{0ex}}\phantom{\rule{0.38889em}{0ex}}{\sigma}_{1}=\text{E}[\sigma ({X}_{gi})],\phantom{\rule{0.38889em}{0ex}}\text{and}\phantom{\rule{0.38889em}{0ex}}{\sigma}_{2}=\text{E}[{\sigma}^{2}({X}_{gi})].$$

Assume that *X _{gi}* are

Under the regularity conditions in the Appendix, for a fixed point x, we have

$${\mathbf{\sum}}^{-1/2}(\mathit{\eta}-({\sigma}^{2}(x)+b(x)+{o}_{P}({h}^{2}))\mathbf{e})\stackrel{D}{\to}N(\mathbf{0},\mathbf{I}),$$

provided that h → 0 and N h → ∞, where **e** = (1, 1, ···, 1)^{T} and

$$\mathbf{\sum}={V}_{1}\mathbf{I}+{V}_{2}(\mathbf{E}-\mathbf{I}),$$

with $b(x)={\scriptstyle \frac{{h}^{2}}{2}}{c}_{K}({\sigma}^{2}{(x))}^{\u2033}$,

$$\begin{array}{l}{V}_{1}=\frac{{d}_{K}}{{\mathit{Nhf}}_{X}(x)}\left\{2{\sigma}^{4}(x)+\frac{4+4(I-1)(I-3)}{(I-1){(I-2)}^{2}}{\sigma}_{2}{\sigma}^{2}(x)+\frac{2}{(I-1)(I-2)}{\sigma}_{2}^{2}\right\},\\ {V}_{2}=\frac{1}{N}\left\{\frac{4}{{(I-1)}^{2}}{\sigma}^{4}(x)-\frac{8}{{(I-1)}^{2}}{\sigma}_{2}{\sigma}^{2}(x)+\frac{2(I-3)}{{(I-1)}^{2}(I-2)}{\sigma}_{2}^{2}\right\}.\end{array}$$

Note that *V*_{2} is one order of magnitude smaller than *V*_{1}. Hence, the estimators
${\widehat{\eta}}_{1}^{2}(x),\cdots ,{\widehat{\eta}}_{I}^{2}(x)$ are asymptotically independently distributed as *N*(*σ*^{2}(*x*) + *b*(*x*), *V*_{1}). Their dependence is only in the second order. The best linear combination of *I* estimators is

$${\widehat{\eta}}^{2}(x)=[{\widehat{\eta}}_{1}^{2}(x)+{\widehat{\eta}}_{2}^{2}(x)+\cdots +{\widehat{\eta}}_{I}^{2}(x)]/I,$$

(6)

with the asymptotic distribution

$$N({\sigma}^{2}(x)+b(x),{V}_{1}/I+(1-1/I){V}_{2}).$$

(7)

See also the aggregated estimator (16) with *ρ* = 0, which has the same asymptotic property as the estimator (8). See Remark 1 below for additional discussion.

Theorem 1 gives the asymptotic normality of the proposed nonparametric estimators under the presence of a large number of nuisance parameters
${\{{\alpha}_{g}\}}_{g=1}^{N}$. With the newly proposed technique, we do not have to impose any assumptions on *α _{g}* such as sparsity or smoothness. This kind of local linear estimator can be applied to most two-color microarray data, for instance, customized arrays and Agilent arrays.

We now consider the case with correlated with-array replications. There is a lot of evidence that correlation among within-array replicated genes exists (Smyth et al., 2005; Fan and Niu, 2007). Suppose that within-array replications have a common correlation corr(*Y _{gi}, Y_{gj}*|

$$var[({Y}_{gi}-{\overline{Y}}_{g})\mid \mathit{X}]={(I-1)}^{2}{\sigma}^{2}({X}_{gi})/{I}^{2}+2\rho \sum _{\begin{array}{c}1\le j<k\le I,\\ j\ne i,k\ne i\end{array}}\sigma ({X}_{gj})\sigma ({X}_{gk})/{I}^{2}+2(I-1)\rho \sum _{j\ne i}{\sigma}^{2}({X}_{gj})/{I}^{2}-\sum _{j\ne i}\sigma ({X}_{gi})\sigma ({X}_{gj})/{I}^{2}.$$

(8)

This is a complex system of nonlinear equations and the analytic form can not be found. Innovative ideas are needed.

Using the same notation as that in the previous section, it can be calculated that

$$\text{E}[{Z}_{gi}\mid \mathit{X}]={\sigma}^{2}({X}_{gi})-\frac{2}{I-1}\sum _{j\ne i}\rho \sigma ({X}_{gi})\sigma ({X}_{gj})+\frac{2}{(I-1)(I-2)}\sum _{\begin{array}{l}1\le j<k\le I,\\ j\ne i,k\ne i\end{array}}\rho \sigma ({X}_{gj})\sigma ({X}_{gk}).$$

Taking the expectation with respect to *X _{gj}* for all

$$\text{E}[{Z}_{gi}\mid {X}_{gi}=x]={\sigma}^{2}(x)-2\rho {\sigma}_{1}\sigma (x)+\rho {\sigma}_{1}^{2}\triangleq {\eta}^{2}(x),$$

(9)

where *σ*_{1} = E[*σ*(*X*)].

Here, we can directly apply the local linear approach to all aggregated data
${\{({X}_{gi},{Z}_{gi})\}}_{i,g=1}^{I,N}$, due to the same regression function (9). Let
${\widehat{\eta}}_{A}^{2}(\xb7)$ be the local linear estimator of *η*^{2}(·), based on the aggregated data. Then,

$${\widehat{\eta}}_{A}^{2}(x)=\sum _{g=1}^{N}\sum _{i=1}^{I}{W}_{N}\left(\frac{{X}_{gi}-x}{h}\right){Z}_{gi},$$

(10)

with

$${W}_{N}(u)={h}^{-1}K(u)\frac{{S}_{NI,2}-{uS}_{NI,1}}{{S}_{NI,0}{S}_{NI,2}-{S}_{NI,1}^{2}},$$

where ${S}_{NI,l}={\sum}_{g=1}^{N}{\sum}_{i=1}^{I}{K}_{h}({X}_{gi}-x){[({X}_{gi}-x)/h]}^{l}$. There are two solutions to (9)

$${\widehat{\sigma}}_{A}{(x,\rho )}^{(1),(2)}=\widehat{\rho}{\widehat{\sigma}}_{1}\pm \sqrt{{\widehat{\rho}}^{2}{\widehat{\sigma}}_{1}^{2}-\widehat{\rho}{\widehat{\sigma}}_{1}^{2}+{\widehat{\eta}}_{A}^{2}(x)},$$

(11)

Notice that given the sample ** X** and

$${\widehat{\sigma}}_{A}(x)=\rho {\sigma}_{1}+\sqrt{{\rho}^{2}{\sigma}_{1}^{2}-\rho {\sigma}_{1}^{2}+{\widehat{\eta}}_{A}^{2}(x)}.$$

(12)

This is called the aggregated estimator. Note that in (12), *ρ*, *σ*_{1} and *σ*(·) are all unknown.

To estimate *ρ*, we assume that there are *J* independent arrays (*J* ≥ 2). In other words, we observed data from (2) independently *J* times. In this case, the residual maximum likelihood (REML) estimator introduced by Smyth et al. (2005) is as follows:

$${\widehat{\rho}}_{0}=\frac{{\sum}_{g=1}^{N}{s}_{B,g}^{2}-{\sum}_{g=1}^{N}{s}_{W,g}^{2}}{{\sum}_{g=1}^{N}{s}_{B,g}^{2}+(I-1){\sum}_{g=1}^{N}{s}_{W,g}^{2}},$$

(13)

where ${s}_{B,g}^{2}=I{(J-1)}^{-1}{\sum}_{j=1}^{J}{({\overline{Y}}_{gj}-{\overline{Y}}_{g})}^{2}$ with ${\overline{Y}}_{gj}={I}^{-1}{\sum}_{i=1}^{I}{Y}_{\mathit{gij}}$ and ${\overline{Y}}_{g}={J}^{-1}{\sum}_{j=1}^{J}{Y}_{gj}$ is the between-arrays variance and ${s}_{W,g}^{2}$ is the within-array variance:

$${s}_{W,g}^{2}=\frac{1}{J(I-1)}\sum _{j=1}^{J}\sum _{i=1}^{I}{({Y}_{\mathit{gij}}-{\overline{Y}}_{gj})}^{2}.$$

As discussed in Smyth et al. (2005), the estimator _{0} of *ρ* is consistent when var(*Y _{gij}*|

$$\widehat{\rho}=\frac{{\sigma}_{2}}{{\sigma}_{1}^{2}}\xb7\frac{{\sum}_{g=1}^{N}{s}_{B,g}^{2}-{\sum}_{g=1}^{N}{s}_{W,g}^{2}}{{\sum}_{g=1}^{N}{s}_{B,g}^{2}+(I-1){\sum}_{g=1}^{N}{s}_{W,g}^{2}}.$$

(14)

The consistency of is given by the following theorem.

Under the regularity condition in the Appendix, the estimator of ρ is $\sqrt{N}$-consistent:

$$\widehat{\rho}-\rho ={O}_{P}({N}^{-1/2}).$$

With a consistent estimator of *ρ*, *σ*_{1}, *σ*_{2} and *σ _{A}*(·) can be solved by the following iterative algorithm:

- Step 1. Se ${\widehat{\eta}}_{A}^{2}(\xb7)$ as an initial estimate of ${\sigma}_{A}^{2}(\xb7)$
- Step 2. With
(·), compute_{A}$${\widehat{\sigma}}_{1}={N}^{-1}\sum _{g=1}^{N}{\widehat{\sigma}}_{A}({X}_{gi}),\phantom{\rule{0.38889em}{0ex}}{\widehat{\sigma}}_{2}={N}^{-1}\sum _{g=1}^{N}{\widehat{\sigma}}_{A}^{2}({X}_{gi}),\phantom{\rule{0.38889em}{0ex}}\widehat{\rho}={\widehat{\rho}}_{0}{\widehat{\sigma}}_{2}/{\widehat{\sigma}}_{1}^{2}.$$(15) - Step 4. Repeat Steps 2 and 3 until convergence.

This provides simultaneously the estimators _{1}, _{2}, , and * _{A}*(·). From our numerical experience, this algorithm converges quickly after a few iterations. When the algorithm converges, the estimator
${\sigma}_{A}^{2}(x)$ is given by

$${\widehat{\sigma}}_{A}(x)=\widehat{\rho}{\widehat{\sigma}}_{1}+\sqrt{{\widehat{\rho}}^{2}{\widehat{\sigma}}_{1}^{2}-\widehat{\rho}{\widehat{\sigma}}_{1}^{2}+{\widehat{\eta}}_{A}^{2}(x)}.$$

(16)

Note that the presence of multiple arrays is only used to estimate the correlation *ρ* for the replications. It is not needed for estimating the genewise variance function. In the case of the presence of *J* arrays, we can take the average of the *J* estimates from each array.

Following a similar idea as the case without correlation, we can derive the asymptotic property of ${\widehat{\eta}}_{A}^{2}(x)$.

Under the regularity conditions in the Appendix, for a fixed point x, we have:

$${\{{V}^{\ast}\}}^{-1/2}\left\{{\widehat{\eta}}_{A}^{2}(x)-[{\eta}^{2}(x)+\beta (x)]+{o}_{P}({h}^{2})\right\}\stackrel{D}{\to}N(0,1),$$

provided that h → 0 and Nh → ∞, with $\beta (x)={\scriptstyle \frac{{h}^{2}}{2}}{c}_{K}({\eta}^{2}{(x))}^{\u2033}$ and

$${V}^{\ast}=\frac{1}{I}{V}_{1}^{\prime}+\frac{I-1}{I}{V}_{2}^{\prime},$$

where

$$\begin{array}{l}{V}_{1}^{\prime}=\frac{{d}_{K}}{{\mathit{Nhf}}_{X}(x)}\left\{2{\sigma}^{4}(x)-8\rho {\sigma}_{1}{\sigma}^{3}(x)+{C}_{2}{\sigma}^{2}(x)+{C}_{3}\sigma (x)+{C}_{4}\right\},\\ {V}_{2}^{\prime}=\frac{1}{N}\left\{{D}_{0}{\sigma}^{4}(x)+{D}_{1}{\sigma}^{3}(x)+{D}_{2}{\sigma}^{2}(x)+{D}_{3}\sigma (x)+{D}_{4}\right\},\end{array}$$

with coefficients C_{2}, ···, C_{4}, D_{0}, ···, D_{4} defined in the Appendix.

The asymptotic normality of
${\widehat{\sigma}}_{A}^{2}(x)$ can be derived from that of
${\widehat{\eta}}_{A}^{2}(x)$. More specifically,
${\widehat{\sigma}}_{A}^{2}(x)=\varphi ({\widehat{\eta}}_{A}^{2}(x))$ with
$\varphi (z)={(\rho {\sigma}_{1}+\sqrt{{\rho}^{2}{\sigma}_{1}^{2}-\rho {\sigma}_{1}^{2}+z})}^{2}$. The derivative of *ϕ*(·) with respect to *z* is
$\psi (z)=\rho {\sigma}_{1}/\sqrt{{\rho}^{2}{\sigma}_{1}^{2}-\rho {\sigma}_{1}^{2}+z}+1$. Then, by the delta method, we have

$${\{{V}^{\ast}\}}^{-1/2}({\widehat{\sigma}}_{A}^{2}(x)-\varphi ({\eta}^{2}(x)+\beta (x)+{o}_{P}({h}^{2})))\stackrel{D}{\to}N(0,{\psi}^{2}({\eta}^{2}(x))).$$

An alternative approach when correlation exists is to apply the same correlation correction idea to ${\{{X}_{gi},{Z}_{gi}\}}_{g=1}^{N}$ for every replication i, resulting in the estimator ${\widehat{\sigma}}_{i}^{2}(x)$. In this case, it can be proved that the best linear combination of the estimator is

$${\widehat{\sigma}}^{2}(x)=[{\widehat{\sigma}}_{1}^{2}(x)+{\widehat{\sigma}}_{2}^{2}(x)+\cdots +{\widehat{\sigma}}_{I}^{2}(x)]/I,$$

(17)

This estimator has the same asymptotic performance as the aggregated estimator. However, we prefer the aggregated estimator due to the following reasons: The equation (16) only needs to be solved once by using the algorithm in §2.2.2, all data are treated symmetrically, and ${\widehat{\eta}}_{A}^{2}(\xb7)$ can be estimated more stably.

The aforementioned methods apply to the case when there are more than two replications. For the case *I* = 2, the equations for var[(*Y _{gi}* −

$$var[({Y}_{gi}-{\overline{Y}}_{g})\mid {X}_{gi}=x]=\frac{1}{4}{\sigma}^{2}(x)+\frac{1}{4}{\sigma}_{2}-\frac{1}{2}\rho {\sigma}_{1}\sigma (x),\phantom{\rule{0.38889em}{0ex}}\phantom{\rule{0.38889em}{0ex}}\phantom{\rule{0.38889em}{0ex}}i=1,2$$

(18)

where *σ*_{2} = E[*σ*^{2}(*X _{gi}*)]. In this case, the left hand side is always equal to var[(

Let ^{2}(*x*) be the local linear estimator of the function on the right hand side by smoothing
${\{{({Y}_{g1}-{Y}_{g2})}^{2}/4\}}_{g=1}^{N}$ on
${\{{X}_{g1}\}}_{g=1}^{N}$ and
${\{{X}_{g2}\}}_{g=1}^{N}$. Then, the genewise variance is a solution to the following equation

$$\widehat{\sigma}(x)=\widehat{\rho}{\widehat{\sigma}}_{1}+\sqrt{{\widehat{\rho}}^{2}{\widehat{\sigma}}_{1}^{2}-{\widehat{\sigma}}_{2}+4{\widehat{\eta}}^{2}(x)}.$$

(19)

. The algorithm in §2.2.2 can be applied directly.

In this section, we conduct simulations to evaluate the finite sample performance of different variance estimators ^{2}(*x*), ^{2}(*x*) and
${\widehat{\sigma}}_{A}^{2}(x)$. First, the bias problem of the naive non-parametric variance estimator ^{2}(*x*) is demonstrated. It is shown that this bias issue can be eliminated by our newly proposed methods. Then, we consider the estimators ^{2}(*x*) and
${\widehat{\sigma}}_{A}^{2}(x)$ under different configurations of the within-array replication correlation.

In all the simulation examples, we set the number of genes *N* = 2000, each gene having *I* = 3 within-array replications, and *J* = 4 independent arrays. For the purpose of investigating the genewise variance estimation, the data are generated from model (2). The details of simulation scheme are summarized as follows:

*α*: The expression levels of the first 250 genes are generated from the standard double exponential distribution. The rest are 0’s. These expression levels are the same over 4 arrays in each simulation, but may vary over simulations._{g}*X*: The intensity is generated from a mixture distribution: with probability 0.7 from the distribution .0004(*x*− 6)^{3}I(6 <*x*< 16) and 0.3 from the uniform distribution over [6, 16].*ε: ε*is generated from the standard normal distribution._{gi}

*σ*^{2}(·): The genewise variance function is taken as

$${\sigma}^{2}(x)=.15+.015{(12-x)}^{2}\text{I}\{x<12\}.$$

The parameters are taken from Fan et al. (2005). The kernel function is selected as
${\scriptstyle \frac{70}{81}}{(1-{\mid x\mid}^{3})}^{3}I(\mid x\mid \le 1)$. In addition, we fix the bandwidth *h* = 1 for all the numerical analysis.

For every setting, we repeat the whole simulation process for *T* times and evaluate the estimates of *σ*^{2}(·) over *K* = 101 grid points
${\{{x}_{k}\}}_{k=1}^{K}$ on the interval [6, 16]. For the *k*-th grid point, we define

$$\begin{array}{c}{B}_{k}={\overline{\sigma}}^{2}({x}_{k})-{\sigma}^{2}({x}_{k})\phantom{\rule{0.38889em}{0ex}}\text{with}\phantom{\rule{0.38889em}{0ex}}{\overline{\sigma}}^{2}({x}_{k})={T}^{-1}\sum _{t=1}^{T}{\widehat{\sigma}}_{t}^{2}({x}_{k}),\\ {S}_{k}={T}^{-1}\sum _{t=1}^{T}{[{\widehat{\sigma}}_{t}^{2}({x}_{k})-{\overline{\sigma}}^{2}({x}_{k})]}^{2},\end{array}$$

and
${\text{MSE}}_{k}={B}_{k}^{2}+{S}_{k}$. Let *f*(·) be the density function of intensity *X*. Let

$${\text{Bias}}^{2}=\sum _{k=1}^{K}{B}_{k}^{2}f({x}_{k})/\sum _{k=1}^{K}f({x}_{k}),\phantom{\rule{0.38889em}{0ex}}\phantom{\rule{0.38889em}{0ex}}\phantom{\rule{0.38889em}{0ex}}\text{VAR}=\sum _{k=1}^{K}{S}_{k}f({x}_{k})/\sum _{k=1}^{K}f({x}_{k})$$

and

$$\text{MISE}=\sum _{k=1}^{K}{\text{MSE}}_{k}f({x}_{k})/\sum _{k=1}^{K}f({x}_{k})$$

be the integrated squared bias (Bias^{2}), the integrated variance (VAR), and the integrated mean squared error (MISE) of the estimate ^{2}(·), respectively. For the *t*-th simulation experiment, we define

$${\text{ISE}}_{t}=\sum _{k=1}^{K}{({\widehat{\sigma}}_{t}^{2}({x}_{k})-{\sigma}^{2}({x}_{k}))}^{2}f({x}_{k})/\sum _{k=1}^{K}f({x}_{k}),$$

be the integrated squared error for the *t*-th simulation.

A naive approach is to regard *α _{g}* in (2) as a smooth function of

To provide a comprehensive view of the performances of the naive and the new estimators, we first compare the performances of ^{2}(*x*) and ^{2}(*x*) under the smoothness assumption of the gene effect *α _{g}*. Data from the naive nonparametric regression model is also generated with

$$\alpha (x)=exp\left(-\frac{1}{1-{(x-13)}^{2}}\right)\text{I}\{12<x<14\}.$$

This allows us to understand the loss of efficiency when *α _{g}* is continuous in

Mean integrated squared bias (Bias^{2}), mean integrated variance (VAR), mean integrated squared error (MISE) over 1000 simulations for different variance estimators ^{2}(x) and
${\widehat{\sigma}}_{O}^{2}(x)$. Seven different correlation schemes are simulated: **...**

In Table 1, we report the mean integrated squared bias (Bias^{2}), the mean integrated variance (VAR), and the mean integrated squared error (MISE) of ^{2}(*x*) and ^{2}(*x*) with and without the smoothness assumption on the gene effect *α _{g}*. From the left panel of Table 1, we can see that when the smoothness assumption is valid, the estimator

Mean integrated squared bias (Bias^{2}), mean integrated variance (VAR), mean integrated squared error (MISE) over 100 simulations for variance estimators ^{2}(x) and ^{2}(x). Two different gene effect functions α(·) **...**

To see how variance estimators behave, we plot typical estimators ^{2}(*x*) and ^{2}(*x*) with median ISE value among 100 simulations in Figure 1. The solid line is the true variance function while the dotted and dashed lines represent ^{2}(*x*) and ^{2}(*x*) respectively. On the left panel of Figure 1, we can see that estimator ^{2}(*x*) outperforms the estimator ^{2}(*x*) when the smoothness assumption is valid. The region where the biases occur has already been explained above. However, ^{2}(*x*) will generate substantial bias when the nonparametric regression model does not hold, and at the same time, our nonparametric estimator ^{2}(*x*) corrects the bias very well.

In this example, we consider the setting in §3.1 that the smoothness assumption of the gene effect *α _{g}* is not valid. For comparison purpose only, we add an oracle estimator
${\widehat{\sigma}}_{O}^{2}(x)$ in which we assume that

It is worth noticing that the performance of
${\widehat{\sigma}}_{O}^{2}(x)$ and
${\widehat{\sigma}}_{A}^{2}(x)$ are almost always the same, which indicates that our algorithm for estimating *ρ*, *σ*_{1} and *σ*_{2} is very accurate. To see this more clearly, the squared bias, variance, and MSE of the estimator *ρ*, *σ*_{1} and *σ*_{2} in
${\widehat{\sigma}}_{A}^{2}(x)$ under the seven correlation settings are reported in Table 3. Here the true value of *σ*_{1} and *σ*_{2} is 0.4217 and 0.1857. For example, when *ρ* = 0.8, the bias of is less than 0.002 for
${\widehat{\sigma}}_{A}^{2}(x)$, which is acceptable because the convergence threshold in the algorithm is set to be 0.001.

Squared bias, variance and MSE of , _{1} and _{2} in the estimate
${\widehat{\sigma}}_{A}^{2}(x)$. All quantities are multiplied by 10^{6}.

In Figure 2, we render the estimates ^{2}(*x*) and
${\widehat{\sigma}}_{A}^{2}(x)$ with the median ISE under four different correlation settings: *ρ* = −0.4, *ρ* = 0, *ρ* = 0.6 and *ρ* = 0.8. We omit the other correlation schemes since they all have similar performance. The solid lines represent the true variance function. The dotted lines and dashed lines are for ^{2}(*x*) and
${\widehat{\sigma}}_{A}^{2}(x)$ respectively. For the case *ρ* = 0, the two estimators are indistinguishable. When *ρ <* 0, ^{2}(*x*) overestimates the genewise variance function, whereas when *ρ* > 0, it underestimates the genewise variance function.

Our real data example comes from Microarray Quality Control (MAQC) project (Patterson et al., 2006). The main purpose of the original paper is on comparison of reproducibility, sensitivity and specificity of microarray measurements across different platforms (i.e., one-color and two-color) and testing sites. The MAQC project use two RNA samples, Stratagene Universal Human Reference total RNA and Ambion Human Brain Reference total RNA. The two RNA samples have been assayed on three kinds of arrays: Agilent, CapitalBio and TeleChem. The data were collected at five sites. Our study focuses only on the Agilent arrays. At each site, 10 two-color Agilent microarrays are assayed with 5 of them dye swapped, totalling 30 microarrays.

In the first application, we revisit the validation test as considered in Fan and Niu (2007). For the purpose of the validation tests, we use gProcessedSignal and rProcessedSignal values from Agilent Feature Extraction software as input. We follow the preprocessing scheme described in Patterson et al. (2006) and get 22144 genes from a total of 41675 non-control genes. Among those, 19 genes with each having 10 replications are used for validation tests. Under the null hypothesis of no experimental biases, a reasonable model is

$${Y}_{gi}={\alpha}_{g}+{\epsilon}_{gi},\phantom{\rule{0.38889em}{0ex}}{\epsilon}_{gi}\sim N(0,{\sigma}_{g}^{2}),\phantom{\rule{0.38889em}{0ex}}i=1,\cdots ,I,g=1,\cdots ,G.$$

(20)

We use the notation *G* to denote the number of genes that have *I* replications. For our data, *G* = 19 and *I* = 10. Note that *G* can be different from *N*, the total number of different genes. The validation test statistics in Fan and Niu (2007) include weighted statistics

$${T}_{1}=\sum _{g=1}^{G}\left\{\sum _{i=1}^{I}{({Y}_{gi}-{\overline{Y}}_{g})}^{2}/{\sigma}_{g}^{2}\right\},\phantom{\rule{0.38889em}{0ex}}\phantom{\rule{0.38889em}{0ex}}\phantom{\rule{0.38889em}{0ex}}{T}_{2}=\sum _{g=1}^{G}\left\{\sum _{i=1}^{I}\mid {Y}_{gi}-{\overline{Y}}_{g}\mid /{\sigma}_{g}\right\},$$

and unweighted test statistics

$$\begin{array}{l}{T}_{3}=\left\{\sum _{g=1}^{G}\sum _{i=1}^{I}{({Y}_{gi}-{\overline{Y}}_{g})}^{2}-(I-1)\sum _{g=1}^{G}{\sigma}_{g}^{2}\right\}{\{2(I-1)\sum _{g=1}^{G}{\sigma}_{g}^{4}\}}^{-1/2},\\ {T}_{4}=\left\{\sum _{g=1}^{G}\sum _{i=1}^{I}\mid {Y}_{gi}-{\overline{Y}}_{g}\mid -{\lambda}_{I}\sum _{g=1}^{G}{\sigma}_{g}\right\}/\left\{{\kappa}_{I}{(\sum _{g=1}^{G}{\sigma}_{g}^{2})}^{1/2}\right\},\end{array}$$

where
${\lambda}_{I}=\sqrt{2I(I-1)/\pi}$ and
${\kappa}_{I}^{2}=var({\sum}_{i=1}^{I}\mid {\epsilon}_{gi}-{\overline{\epsilon}}_{g}\mid /{\sigma}_{g})$. Under the null hypothesis, the test statistic *T*_{1} is *χ*^{2} distributed with degree of freedom (*I* − 1)*G* and *T*_{2}*, T*_{3} and *T*_{4} are all asymptotically normally distributed. As a result, the corresponding *p*-values can be easily computed.

Here, we apply the same statistics *T*_{1}, *T*_{2}, *T*_{3} and *T*_{4} but we replace the pooled sample variance estimator by the aggregated local linear estimator

$${\widehat{\sigma}}_{g}^{2}=\sum _{i=1}^{I}{\widehat{\sigma}}_{A}^{2}({X}_{gi})\widehat{f}({X}_{gi})/\sum _{i=1}^{I}\widehat{f}({X}_{gi}),$$

where is the estimated density function of *X _{gi}*. The difference between the new variance estimator and the simple pooled variance estimator is that we consider the genewise variance as a nonparametric function of the intensity level. The latter estimator may drag small variances of certain arrays to much higher levels by averaging, resulting in a larger estimated genewise variance and smaller test statistics or bigger

In the analysis here, we first consider all thirty arrays. The estimated correlation among replicated genes is = 0.69. The *p*-values based on the newly estimated genewise variance are depicted in Table 4. As explained in Fan and Niu (2007), *T*_{4} is the most stable test among the four. It turns out that none of the arrays needs further normalization, which is the same as Fan and Niu (2007). Furthermore, we separate the analysis into two groups: the first group using 15 arrays without dye-swap, which has the estimated correlation = 0.66, and the second group using 15 arrays with dye-swap, resulting in an estimated correlation = 0.34. The *p*-values are summarized in Table 5. Results show that array AGL-2-D3 and array AGL-2-D5 need further normalization if 5% significance level applies. The difference is due to decreased estimated *ρ* for the dye swap arrays and *p*-values are sensitive to the genewise variance. We also did analysis by separating data into 6 groups: with and without dye swap, and three sites of experiments. Due to the small sample size, the six estimates of *ρ* range from 0.08 to 0.74, and we also find that array AGL-2-D3 needs further normalization.

Comparison of p-values for T_{1}, ···, T_{4} for MAQC project data considering all 30 arrays together

To detect the differentially expressed genes, we follow the filter instruction and get 19,802 genes out of 41,000 unique non-control genes as in Patterson et al. (2006), i.e., *I* = 1. The dye swap result was averaged before doing the one-sample *t*-test. Thus at each site, we have five microarrays.

For each site, significant genes are selected based on these 5 dye-swaped average arrays. For all *N* = 19, 802 genes, there are no within-array replications. However, model (2) is still reasonable, in which *i* indexes the array. Hence, the “within-array correlation” becomes “between-array correlation” and is reasonably assumed as *ρ* = 0.

In our nonparametric estimation for the variance function, all the 19,802 genes are used to estimate the variance function, which gives us enough reason to believe that the estimator
${\widehat{\sigma}}_{A}^{2}(x)$ is close to the inherent true variance function *σ*^{2}(*x*).

We applied both the *t*-test and *z*-test to each gene to see if the logarithm of the expression ratio is zero, using the five arrays collected at each location. The number of differentially expressed genes detected by using the two different tests under three Fold Changes (FC) and four significant levels are given in Table 6. Large numbers of genes are identified as differentially expressed, which is expected when comparing a brain sample and a tissue pool sample (Patterson et al., 2006). We can see clearly that the *z*-test associated with our new variance estimator
${\widehat{\sigma}}_{A}^{2}(x)$ leads to more differentially expressed genes. For example, at site 1, using *α* = 0.001, among the fold changes at least 2, *t*-test picks 8231 genes whereas *z*-test selects 8875 genes. This gives an empirical power increase of (8875–8231)/19802 ≈ 3.25% in the group with observed fold change at least 2.

To verify the accuracy of our variance estimation in the *z*-test, we compare the empirical power increase with the expected theoretical power increase. The expected theoretical power increase is computed as

$$\text{ave}\{{\text{P}}_{z}({\mu}_{g}/{\sigma}_{g})-{\text{P}}_{{t}_{n-1}}({\mu}_{g}/{\sigma}_{g})\},$$

(21)

taking the average of power increases across all *μ _{g}* ≠ 0. However, in the absence of the availability, we replace

There are two things worth noticing here. First, for expected theoretical power increase, we use the sample mean * _{g}* =

We also apply SIMEX and permutation SIMEX methods in Wang and Carroll (2008) to the MAQC data, to illustrate its utility. As mentioned in the introduction, their model is not really intended for the analysis of two-color microarray data. Should we only use the information on log-ratios (*Y*), the model is very hard to interpret. In addition, one might question why the information on *X* (observed intensity levels) is not used at all. Nevertheless, we apply the SIMEX methods of Wang and Carroll (2008) to only the log-ratios *Y* in the two-color data and produce similar tables to the Table 6 and and77.

From the results, we have the following understandings. First, all the numbers for *z*-test in Tables 8 and and99 at all significance levels are approximately the same. In fact, the *p*-values are very small so that numbers at different significance levels are about the same. That indicates that both SIMEX and permutation SIMEX method are tending to estimate genewise variance very small, making the test statistics large for all the time. On the other hand, our method estimates the genewise variance moderately so that the numbers are not exactly the same for different significance levels. Second, in the implementation, we found that the SIMEX and permutation SIMEX is computationally expensive (more than one hour) while our method only takes a few minutes. Third, from Tables 10 and and1111 we can see that the expected theoretical power increase and the empirical ones are reasonably close, which are in lines with our method.

Comparison of Expected Theoretical and Empirical Power Difference using SIMEX method (In percentage)

The estimation of genewise variance function is motivated by the downstream analysis of microarray data such as validation test and selecting statistically differentially expressed genes. The methodology proposed here is novel by using across-array and within-array replications. It does not require any specific assumptions on *α _{g}* such as sparsity or smoothness, and hence reduces the bias of the conventional nonparametric estimators. Although the number of nuisance parameters is proportional to the sample size, we can estimate the main interest (variance function) consistently. By increasing the degree of freedom largely, both the validation tests and

Our proposed methodology has a wide range of applications. In addition to the microarray data analysis with within-array replications, it can be also applied to the case without within-array replications, as long as the model (2) is reasonable. Our two-way nonparametric model is a natural extension of the Neyman-Scott problem. Therefore, it is applicable to all the problems where the Neyman-Scott problem is applicable.

There are possible extensions. For example, the SIMEX idea can be applied on our model in order to take into account the measurement error. We can also make adaptations to our methods when we have a prior correlation structure among replications other than the identical correlation assumption.

The authors thank the Editor, the Associate Editor and two referees, whose comments have greatly improved the scope and presentation of the paper.

The following regularity conditions are imposed for the technical proofs:

- The regression function
*σ*^{2}(*x*) has a bounded and continuous second derivative. - The kernel function
*K*is a bounded symmetric density function with a bounded support. - h → 0, Nh → ∞.
- E[
*σ*^{8}(*X*)] exists and the marginal density*f*(·) is continuous._{X}

We need the following conditional variance-covariance matrix of the random vector **Z*** _{g}* in our asymptotic study.

Let **Ω** be the variance-covariance matrix of **Z**_{g} conditioning on all data **X**. Then respectively the diagonal and off-diagonal elements are:

$${\mathrm{\Omega}}_{ii}=2{\sigma}^{4}({X}_{gi})+\frac{2}{{(I-1)}^{2}{(I-2)}^{2}}\sum _{k\ne l}{\sigma}^{2}({X}_{gk}){\sigma}^{2}({X}_{gl})+\frac{4(I-3)}{(I-1){(I-2)}^{2}}{\sigma}^{2}({X}_{gi})\sum _{j\ne i}{\sigma}^{2}({X}_{gj}),\phantom{\rule{0.38889em}{0ex}}\phantom{\rule{0.38889em}{0ex}}\phantom{\rule{0.38889em}{0ex}}i=1,\cdots ,I,$$

(22)

$${\mathrm{\Omega}}_{ij}=\frac{4}{{(I-1)}^{2}}{\sigma}^{2}({X}_{gi}){\sigma}^{2}({X}_{gj})+\frac{2}{{(I-1)}^{2}{(I-2)}^{2}}\sum _{\begin{array}{c}k\ne l\\ k,l\ne i,j\end{array}}{\sigma}^{2}({X}_{gk}){\sigma}^{2}({X}_{gl})-\frac{4}{{(I-1)}^{2}(I-2)}\sum _{k\ne i,j}{\sigma}^{2}({X}_{gk})({\sigma}^{2}({X}_{gi})+{\sigma}^{2}({X}_{gj})),\phantom{\rule{0.38889em}{0ex}}\phantom{\rule{0.38889em}{0ex}}\phantom{\rule{0.38889em}{0ex}}i,j=1,\cdots ,I,i\ne j.$$

(23)

Let **A** be the variance-covariance matrix of **r*** _{g}* conditioning on all data

$$\begin{array}{l}{A}_{ii}=var[{({Y}_{gi}-{\overline{Y}}_{g})}^{2}\mid \mathit{X}]\\ =\frac{2{(I-1)}^{4}}{{I}^{4}}{\sigma}^{4}({X}_{gi})+\frac{4{(I-1)}^{2}}{{I}^{4}}\sum _{k\ne i}{\sigma}^{2}({X}_{gi}){\sigma}^{2}({X}_{gk})+\frac{2}{{I}^{4}}\sum _{k\ne i}{\sigma}^{4}({X}_{gk})+\frac{4}{{I}^{4}}\sum _{l,k\ne i,l<k}{\sigma}^{2}({X}_{gl}){\sigma}^{2}({X}_{gk}),\phantom{\rule{0.38889em}{0ex}}\phantom{\rule{0.38889em}{0ex}}\phantom{\rule{0.38889em}{0ex}}i=1,\cdots ,I,\end{array}$$

(24)

and the off-diagonal elements are given by

$$\begin{array}{l}{A}_{ij}=cov\{[{({Y}_{gi}-{\overline{Y}}_{g})}^{2},{({Y}_{gj}-{\overline{Y}}_{g})}^{2}]\mid \mathit{X}\}\\ =\frac{2{(I-1)}^{2}}{{I}^{4}}[{\sigma}^{4}({X}_{gi})+{\sigma}^{4}({X}_{gj})]+\frac{4{(I-1)}^{2}}{{I}^{4}}{\sigma}^{2}({X}_{gi}){\sigma}^{2}({X}_{gj})-\frac{4(I-1)}{{I}^{4}}\sum _{k\ne i,j}{\sigma}^{2}({X}_{gk})({\sigma}^{2}({X}_{gi})+{\sigma}^{2}({X}_{gj}))+\frac{4}{{I}^{4}}\sum _{k,l\ne i,j;l<k}{\sigma}^{2}({X}_{gl}){\sigma}^{2}({X}_{gk})+\frac{2}{{I}^{4}}\sum _{k\ne i,j}{\sigma}^{4}({X}_{gk}).\end{array}$$

(25)

Using **Ω** = **BAB*** ^{T}*, we can obtain the result by direct computation.

The proofs for Theorems 1 and 3 follow a similar idea. Since Theorem 1 doesn’t involve a lot of coefficients, we will show the proof of Theorem 1 and explain the difference in the proof of Theorem 3.

First of all, the bias of ${\eta}_{i}^{2}(x)$ comes from the local linear approximation. Since ${\{({X}_{gi},{Z}_{gi})\}}_{g=1}^{N}$ is an i.i.d. sequence, by (4) and the result in Fan and Gijbels (1996), it follows that

$$E\{{\eta}_{i}^{2}(x)\mid \mathit{X}\}={\sigma}^{2}(x)+b(x)+{o}_{P}({h}^{2}).$$

Similarly, the asymptotic variance of ${\eta}_{i}^{2}(x)$ also follows from Fan and Gijbels (1996).

We now prove the off-diagonal elements in matrix var[** η**|

$$cov[({\widehat{\eta}}_{i}^{2}(x),{\widehat{\eta}}_{j}^{2}(x))\mid \mathit{X}]={V}_{2}+{o}_{P}(1/N).$$

(26)

Recalling that ${\widehat{\eta}}_{i}^{2}(x)={\sum}_{g=1}^{N}{W}_{N,i}(({X}_{gi}-x)/h){Z}_{gi}$, we have

$$cov[({\widehat{\eta}}_{i}^{2}(x),{\widehat{\eta}}_{j}^{2}(x))\mid \mathit{X}]=\sum _{g=1}^{N}{W}_{N,i}\left(\frac{{X}_{gi}-x}{h}\right){W}_{N,j}\left(\frac{{X}_{gj}-x}{h}\right)cov[({Z}_{gi},{Z}_{gj})\mid \mathit{X}].$$

The equality follows by the fact that cov[(*Z _{gi}, Z_{g}*

$$N\xb7cov[({\widehat{\eta}}_{i}^{2}(x),{\widehat{\eta}}_{j}^{2}(x))\mid \mathit{X}]=\sum _{g=1}^{N}{W}_{N,i}\left(\frac{{X}_{gi}-x}{h}\right){R}_{N,g}.$$

(27)

The right hand side of (27) can be seen as local linear smoother of the synthetic data
${\{({X}_{gi},{R}_{N,g})\}}_{g=1}^{N}$. Although *R _{N,g}* involves

$$N\xb7cov[({\widehat{\eta}}_{i}^{2}(x),{\widehat{\eta}}_{j}^{2}(x))\mid \mathit{X}]=\text{E}[{R}_{N,g}\mid {X}_{gi}=x]+{o}_{P}(1).$$

To calculate E[*R _{N,g}|X_{gi}* =

$$\begin{array}{l}\text{E}[{R}_{N,g}\mid {X}_{gi}=x]=\text{E}\left[N\xb7\frac{1}{{\mathit{Nhf}}_{X}(x)}{hK}_{h}({X}_{gj}-x){\mathrm{\Omega}}_{ij}\mid {X}_{gi}=x\right](1+{o}_{P}(1))\\ ={({f}_{X}(x))}^{-1}\int K(u){\mathrm{\Omega}}_{ij}{\mid}_{{X}_{gi}=x}(x+hu,\mathbf{s}){f}_{X}(x+hu)\mathit{dud}\mathbf{s}+{o}_{P}(1)\\ ={NV}_{2}+{o}_{P}(1),\end{array}$$

where **s** represents all the integrating variables corresponding to *X _{g}*

To prove the multivariate asymptotic normality

$${\mathbf{\sum}}^{-1/2}(\mathit{\eta}-({\sigma}^{2}(x)+b(x)+{o}_{P}({h}^{2}))\mathbf{e})\stackrel{D}{\to}N(0,{\mathbf{I}}_{I}),$$

(28)

we employ Cramér-Wold device: for any unit vector **a** = (*a*_{1}*,* ···*, a _{I}*)

$${F}^{\ast}\triangleq {\{{\mathbf{a}}^{T}\mathbf{\sum}\mathbf{a}\}}^{-1/2}\left\{\sum _{i=1}^{I}{a}_{i}\sum _{g=1}^{N}{W}_{N,i}\left(\frac{{X}_{gi}-x}{h}\right)({Z}_{gi}-{\sigma}^{2}({Z}_{gi}))\right\}\stackrel{D}{\to}N(0,1).$$

Denote by *Q _{g,i}* =

$$\underset{N\to \infty}{lim}\frac{{\sum}_{g=1}^{N}\text{E}[{\mid {\stackrel{\sim}{Q}}_{g}\mid}^{4}\mid \mathit{X}]}{{({\sum}_{g=1}^{N}\text{E}[{\mid {\stackrel{\sim}{Q}}_{g}\mid}^{2}\mid \mathit{X}])}^{2}}=0.$$

To facilitate the presentation, we first note that sequences
${\{{Q}_{g,i}\}}_{g=1}^{N}$ are i.i.d. and satisfy Lyapunov’s condition for each fixed *i*. Denote
${\delta}_{N,i}^{2}={\sum}_{g=1}^{N}\text{E}[{\mid {Q}_{g,i}\mid}^{2}\mid \mathit{X}]$. And recall that
${\delta}_{N,i}^{2}=var[{\widehat{\eta}}_{i}^{2}(x)\mid \mathit{X}]={O}_{P}({(Nh)}^{-1})$. Let *c** be a generic constant which may vary from one line to another. We have the following approximation

$$\begin{array}{l}\sum _{g=1}^{N}\text{E}[{\mid {Q}_{g,i}\mid}^{4}\mid \mathit{X}]={c}^{\ast}{N}^{-3}\text{E}\{{K}_{h}^{4}({X}_{gi}-x)[{({Z}_{gi}-{\sigma}^{2}({X}_{gi}))}^{4}\mid \mathit{X}]\}(1+{o}_{P}(1))\\ ={O}_{P}({(Nh)}^{-3}).\end{array}$$

Therefore ${\sum}_{g=1}^{N}\text{E}[{\mid {Q}_{g,i}\mid}^{4}\mid \mathit{X}]=o({\delta}_{N,i}^{4})$. By the marginal Lyapunov conditions, we have the following inequality

$$\sum _{g=1}^{N}\text{E}[{\stackrel{\sim}{Q}}_{g}^{4}\mid \mathit{X}]\le {c}^{\ast}\sum _{i=1}^{I}\sum _{g=1}^{N}\text{E}[{\mid {Q}_{g,i}\mid}^{4}\mid \mathit{X}]={c}^{\ast}I\xb7{o}_{P}({(Nh)}^{-2})={o}_{P}({(Nh)}^{-2}).$$

For the denominator, we have the following arguments

$$\begin{array}{l}\sum _{g=1}^{N}\text{E}[{\mid {\stackrel{\sim}{Q}}_{g}\mid}^{2}\mid \mathit{X}]=\sum _{i}{a}_{i}^{2}\sum _{g=1}^{N}\text{E}[{Q}_{g,i}^{2}\mid \mathit{X}]+\sum _{i\ne j}{a}_{i}{a}_{j}\sum _{g=1}^{N}\text{E}[{Q}_{g,i}{Q}_{g,j}\mid \mathit{X}]\\ =\sum _{i}{a}_{i}^{2}var[{\widehat{\eta}}_{i}^{2}(x)\mid \mathit{X}]+\sum _{i\ne j}{a}_{i}{a}_{j}cov[({\widehat{\eta}}_{i}^{2}(x),{\widehat{\eta}}_{j}^{2}(x))\mid \mathit{X}]\\ \stackrel{\ast}{=}{O}_{P}({(Nh)}^{-1})+{O}_{P}({N}^{-1})\\ ={O}_{P}({(Nh)}^{-1}).\end{array}$$

Note that the second to last equality holds by the asymptotic conditional variance-covariance matrix **Σ**. Therefore Lyapunov’s condition is justified. That completes the proof.

First of all, for each given *g*,

$$\text{E}{s}_{B,g}^{2}=Ivar({\overline{Y}}_{gj})={\sigma}_{2}+\rho (I-1){\sigma}_{1}^{2}.$$

Note that by (8), we have

$$\begin{array}{l}E{({Y}_{\mathit{gij}}-{\overline{Y}}_{gj})}^{2}={I}^{-2}[I(I-1){\sigma}_{2}+\rho (I-1)(I-2){\sigma}_{1}^{2}-2{(I-1)}^{2}\rho {\sigma}_{1}^{2}]\\ ={I}^{-1}(I-1)({\sigma}_{2}-\rho {\sigma}_{1}^{2}).\end{array}$$

Thus, for all *g*, we have

$$\text{E}{s}_{W,g}^{2}={\sigma}_{2}-\rho {\sigma}_{1}^{2}.$$

Since {
${s}_{B,g}^{2}$} and {
${s}_{W,g}^{2}$} are i.i.d sequences across the *N* genes, by the central limit theorem, we have

$$\begin{array}{l}\frac{1}{N}\sum _{g=1}^{N}{s}_{B,g}^{2}={\sigma}_{2}+\rho (I-1){\sigma}_{1}^{2}+{O}_{P}({N}^{-1/2}),\\ \frac{1}{N}\sum _{g=1}^{N}{s}_{W,g}^{2}={\sigma}_{2}-\rho {\sigma}_{1}^{2}+{O}_{P}({N}^{-1/2}).\end{array}$$

Therefore,

$$\begin{array}{l}{\widehat{\rho}}_{0}=\frac{{\sigma}_{2}+\rho (I-1){\sigma}_{1}^{2}-{\sigma}_{2}+\rho {\sigma}_{1}^{2}+{O}_{P}({N}^{-1/2})}{{\sigma}_{2}+\rho (I-1){\sigma}_{1}^{2}+(I-1)({\sigma}_{2}-\rho {\sigma}_{1}^{2})+{O}_{P}({N}^{-1/2})}\\ =\rho {\sigma}_{1}^{2}/{\sigma}_{2}+{O}_{P}({N}^{-1/2}).\end{array}$$

Note that

$$var[{\widehat{\eta}}_{A}^{2}(x)\mid \mathit{X}]=\sum _{g=1}^{N}\sum _{i=1}^{I}{W}_{N}^{2}\left(\frac{{X}_{gi}-x}{h}\right)var[{Z}_{gi}\mid \mathit{X}]+\sum _{g=1}^{N}\sum _{i\ne j}^{I}{W}_{N}\left(\frac{{X}_{gi}-x}{h}\right){W}_{N}\left(\frac{{X}_{gj}-x}{h}\right)cov[({Z}_{gi},{Z}_{gj})\mid \mathit{X}].$$

Following similar steps in the proof of Theorem 1, one can verify
$var[{\widehat{\eta}}_{A}^{2}(x)\mid \mathit{X}]={V}_{1}^{\prime}/I+(1-1/I){V}_{2}^{\prime}+{o}_{P}({(Nh)}^{-1})$, where the coefficients *C*_{2}, ···, *C*_{4}, *D*_{0}, ···, *D*_{4} are as follows:

$$\begin{array}{l}{C}_{2}=\frac{4(1+{\rho}^{2}){\sigma}_{2}+[4\rho (I-2)+4{\rho}^{2}(2I-3)]{\sigma}_{1}^{2}}{I-1},\\ {C}_{3}=-\frac{8{\rho}^{2}(I-3){\sigma}_{1}^{3}+8({\rho}^{2}+\rho ){\sigma}_{1}{\sigma}_{2}}{I-1},\\ {C}_{4}=\frac{2}{(I-1)(I-2)}\left\{(1+{\rho}^{2}){\sigma}_{2}^{2}+2({\rho}^{2}+\rho )(I-3){\sigma}_{1}^{2}{\sigma}_{2}+(I-3)(I-4){\rho}^{2}{\sigma}_{1}^{4}\right\},\\ {D}_{0}=2\left({\rho}^{2}-\frac{4\rho}{I-1}+\frac{2(1+{\rho}^{2})}{{(I-1)}^{2}}\right),\\ {D}_{1}=\frac{8}{{(I-1)}^{2}}\left\{(2I-4)\rho -({I}^{2}-4I+5){\rho}^{2}\right\}{\sigma}_{1},\\ {D}_{2}=\frac{4}{{(I-1)}^{2}(I-2)}\left\{{(I-3)}^{2}{\rho}^{2}+({(I-2)}^{2}+1)\rho -2(I-2)\right\}{\sigma}_{2}+\frac{4(I-3)}{{(I-1)}^{2}(I-2)}\left\{(3(I-2)(I-3)+2){\rho}^{2}-2(I-2)\rho \right\}{\sigma}_{1}^{2},\\ {D}_{3}=-\frac{8{(I-3)}^{2}}{{(I-1)}^{2}(I-2)}\left\{({\rho}^{2}+\rho ){\sigma}_{1}{\sigma}_{2}+(I-4){\rho}^{2}{\sigma}_{1}^{3}\right\},\\ {D}_{4}=\frac{4}{{(I-1)}^{2}{(I-2)}^{2}}\left\{(1+{\rho}^{2})\left(\begin{array}{c}I-2\\ 2\end{array}\right){\sigma}_{2}^{2}+6({\rho}^{2}+\rho )\left(\begin{array}{c}I-2\\ 3\end{array}\right){\sigma}_{1}^{2}{\sigma}_{2}+12{\rho}^{2}\left(\begin{array}{c}I-2\\ 4\end{array}\right){\sigma}_{1}^{4}\right\}.\end{array}$$

^{*}The financial support from NSF grant DMS-0714554 and NIH grant R01-GM072611 are greatly acknowledged.

Jianqing Fan, Department of Operations Research and Financial Engineering, Princeton University, Princeton, NJ 08544.

Yang Feng, Department of Operations Research and Financial Engineering, Princeton University, Princeton, NJ 08544.

Yue S. Niu, Department of Mathematics, University of Arizona, 617 N. Santa Rita Ave., P.O. Box 210089, Tucson, AZ 85721-0089.

- Carroll RJ, Wang Y. Nonparametric variance estimation in the analysis of microarray data: a measurement error approach. Biometrika. 2008 to appear. [PMC free article] [PubMed]
- Cui X, Hwang JT, Qiu J. Improved statistical tests for differential gene expression by shrinking variance components estimates. Biostatistics. 2005;6:59–75. [PubMed]
- Fan J, Gijbels I. Local Polynomial Modelling and Its Applications. Chapman & Hall; 1996.
- Fan J, Niu Y. Selection and validation of normalization methods for c-DNA microarrays using within-array replications. Bioinformatics. 2007;23:2391–2398. [PubMed]
- Fan J, Peng H, Huang T. Semilinear high-dimensional model for normalization of microarray data: a theoretical analysis and partial consistency (with discussion) Journal of the American Statistical Association. 2005;100:781–813.
- Fan J, Ren Y. Statistical analysis of DNA microarray data in cancer research. Clinical Cancer Research. 2007;12:4469–4473. [PubMed]
- Fan J, Tam P, Vande Woude G, Ren Y. Normalization and analysis of cDNA micro-arrays using within-array replications applied to neuroblastoma cell response to a cytokine. Proceedings of the National Academy of Sciences, USA. 2004;101(5):1135–1140. [PubMed]
- Huang J, Wang D, Zhang C. A two-way semi-linear model for normalization and significant analysis of cDNA microarray data. Journal of the American Statistical Association. 2005;100:814–829.
- Kamb A, Ramaswami A. A simple method for statistical analysis of intensity differences in microarray-deried gene expression data. BMC Biotechnology. 2001:1–8. [PMC free article] [PubMed]
- Neyman J, Scott E. Consistent estimates based on partially consistent observations. Econometrica. 1948;16:1–32.
- Patterson T, et al. Performance comparison of one-color and two-color platforms within the MicroArray Quality Control (MAQC) project. Nature Biotechnology. 2006;24:1140–1150. [PubMed]
- Ruppert D, Wand MP, Holst U, Hössjer O. Local polynomial variance function estimation. Technometrics. 1997;39:262–273.
- Smyth G, Michaud J, Scott H. Use of within-array replicate spots for assessing differential expression in microarray experiments. Bioinformatics. 2005;21:2067–2075. [PubMed]
- Storey JD, Tibshirani R. Statistical significance for genome-wide studies. Proceedings of the National Academy of Sciences, USA. 2003;100:9440–9445. [PubMed]
- Tong T, Wang Y. Optimal Shrinkage Estimation of Variances with Applications to Microarray Data Analysis. Journal of the American Statistical Association. 2007;102:113–122.
- Tusher VG, Tibshirani R, Chu G. Significance analysis of microarrays applied to the ionizing radiation response. Proceedings of the National Academy of Sciences. 2001;98:5116–5121. [PubMed]
- Wang Y, Ma Y, Carroll RJ. Variance Estimation in the Analysis of Microarray Data. Journal of the Royal Statistical Society, SerB. 2008 to appear. [PMC free article] [PubMed]