Home | About | Journals | Submit | Contact Us | Français |

**|**HHS Author Manuscripts**|**PMC5394596

Formats

Article sections

- Abstract
- 1. Introduction
- 2. Preliminaries
- 3. Heterogeneous Data: Aggregation of Commonality
- 4. Examples
- 5. Application to Homogeneous Data: Divide-and-Conquer Approach
- 6. Numercial Experiment
- 7. Proof of Main Results
- Supplementary Material
- References

Authors

Related links

Ann Stat. Author manuscript; available in PMC 2017 April 18.

Published in final edited form as:

PMCID: PMC5394596

NIHMSID: NIHMS751904

We consider a partially linear framework for modelling massive heterogeneous data. The major goal is to extract common features across all sub-populations while exploring heterogeneity of each sub-population. In particular, we propose an aggregation type estimator for the commonality parameter that possesses the (non-asymptotic) minimax optimal bound and asymptotic distribution as if there were no heterogeneity. This oracular result holds when the number of sub-populations does not grow too fast. A plug-in estimator for the heterogeneity parameter is further constructed, and shown to possess the asymptotic distribution as if the commonality information were available. We also test the heterogeneity among a large number of sub-populations. All the above results require to regularize each sub-estimation as though it had the entire sample size. Our general theory applies to the divide-and-conquer approach that is often used to deal with massive homogeneous data. A technical by-product of this paper is the statistical inferences for the general kernel ridge regression. Thorough numerical results are also provided to back up our theory.

In this paper, we propose a partially linear regression framework for
modelling massive heterogeneous data. Let ${\left\{({Y}_{i},{\mathit{X}}_{i},{Z}_{i})\right\}}_{i=1}^{N}$ be samples from an underlying distribution that may change with
*N*. We assume that there exist *s* independent
sub-populations, and the data from the *j*th sub-population follows a
partially linear model:

$$Y={\mathit{X}}^{T}{\beta}_{0}^{\left(j\right)}+{f}_{0}\left(Z\right)+\epsilon ,$$

(1.1)

where *ε* has zero mean and known variance
*σ*^{2}. In the above model, *Y*
depends on ** X** through a linear function that may vary
across all sub-populations, and depends on

The model (1.1) is motivated
by the following scenario: different labs conduct the same experiment on the
relationship between a response variable *Y* (e.g., heart disease)
and a set of predictors *Z, X*_{1},
*X*_{2},...,*X _{p}*. It is known
from biological knowledge that the dependence structure between

Statistical modelling for massive data has attracted a flurry of recent
research. For homogeneous data, the statistical studies of the divide-and-conquer
method currently focus on either parametric inferences, e.g., Bag of Little
Bootstraps (Kleiner et al., 2012), and
parallel MCMC computing (Wang and Dunson,
2013), or nonparametric minimaxity (Zhang
et al., 2013). The other relevant work includes high dimensional linear
models with variable selection (Chen and Xie,
2012) and structured perceptron (McDonald
et al., 2010). Heterogenous data are often handled by fitting mixture
models (Aitkin and Rubin, 1985; McLachlan and Peel, 2004; Figueiredo and Jain, 2002), time varying coefficient models
(Hastie and Tibshirani, 1993; Fan and Zhang, 1999) or multitask regression
(Huang and Zhang, 2010; Nardi and Rinaldo, 2008; Obozinski et al., 2008). The recent high dimensional work
includes Stäadler et al. (2010); Meinshausen and Bühlmann (2014).
However, as far as we are aware, *semi-nonparametric inference* for
massive homogeneous/heterogeneous data still remains untouched.

In this paper, our primary goal is to extract common features across all sub-populations while exploring heterogeneity of each sub-population. Specifically, we employ a simple aggregation procedure, which averages commonality estimators across all sub-populations, and then construct a plug-in estimator for each heterogeneity parameter based on the combined estimator for commonality. The secondary goal is to apply the divide-and-conquer method to the sub-population having a huge sample size that is unable to be processed in one single computer. The above purposes are achieved by estimating our statistical model (1.1) with the kernel ridge regression (KRR) method. The KRR framework is known to be very flexible and well supported by the general reproducing kernel Hilbert space (RKHS) theory (Mendelson, 2002; Steinwart et al., 2009; Zhang, 2005). In particular, the partial smoothing spline model (Wahba, 1990) can be viewed a special case. An important technical contribution of this paper is that a (point-wise) limit distribution of the KRR estimate is established by generalizing the smoothing spline inference results in Cheng and Shang (2013). This theoretical innovation makes our work go beyond the existing statistical study on the KRR estimation in large datastes, which mainly focus on their nonparametric minimaxity, e.g., Zhang et al. (2013); Bach (2012); Raskutti et al. (2014).

Our theoretical studies are mostly concerned with the so-called
“oracle rule” for massive data. Specifically, we define the
“oracle estimate” for commonality (heterogeneity) as the one computed
when all the heterogeneity information are given (the commonality information is
given in each-subpopulation), i.e., ${\beta}_{0}^{\left(j\right)}$'s are known (*f*_{0} is known). We claim
that a commonality estimator satisfies the oracle rule if it possesses the same
minimax optimality and asymptotic distribution as the “oracle
estimate” defined above. A major contribution of this paper is to derive the
largest possible diverging rate of *s* under which our combined
estimator for commonality satisfies the oracle rule. In other words, our aggregation
procedure can “filter out” the heterogeneity in data when
*s* does not grow too fast with *N*.
Interestingly, we have to set a lower bound on *s* for our
heterogeneity estimate to possess the asymptotic distribution as if the commonality
information were available, i.e., oracle rule. Our second contribution is to test
the heterogeneity among a large number of sub-populations by employing the recent
Gaussian approximation theory (Chernozhukov et al.,
2013). The above results directly apply to the divide-and-conquer
approach that deals with the sub-population with a huge sample size. In this case,
the “oracle estimate” is defined as those computed based on the entire
(homogeneous) data in those sub-populations. A rather different goal here is to
explore the most computationally efficient way to split the whole sample while
performing the best possible statistical inference. Specifically, we derive the
largest possible number of splits under which the averaged estimators for both
components enjoy the same statistical properties as the oracle estimators.

In both homogeneous and heterogeneous setting above, we note that the upper
bounds established for *s* increase with the smoothness of
*f*_{0}. Hence, our aggregation procedure favors smoother
regression functions in the sense that more sub-populations/splits are allowed in
the massive data. On the other hand, we have to admit that our upper and lower bound
results for *s* are only sufficient conditions although empirical
results show that our bounds are quite sharp. Another interesting finding is that
even the semi-nonparametric estimation is applied to only one fraction of the entire
data, it is nonetheless essential to regularize each sub-estimation as if it had the
entire sample.

In the end, we highlight two key technical challenges: (i) nontrivial
interaction between the parametric and nonparametric components in the
*semi-nonparametric estimation*. In particular, we observe a
“bias propagation” phenomenon: the bias introduced by the penalization
of the nonparametric component propagates to the parametric component, and the
resulting parametric bias in turn propagates back to the nonparametric component. To
analyze this complicated propagation mechanism, we extend the existing RKHS theory
to an enlarged partially linear function space by defining a novel inner product
under which the expectation of the Hessian of the objective function becomes
identity. (ii) double asymptotics framework in terms of diverging *s*
and *N*. In this challenging regime, we develop more refined
concentration inequalities in characterizing the concentration property of an
aggregated empirical processes. These delicate theoretical analysis show that an
average of *s* asymptotic linear expansions is still a valid one as $s\wedge N\to \infty $.

The rest of the paper is organized as follows: Section 2 briefly introduces the general RKHS theory and discusses its extension to an enlarged partially linear function space. Section 3 describes our aggregation procedure, and studies the “oracle” property of this procedure from both asymptotic and non-asymptotic perspectives. The efficiency boosting of heterogeneity estimators and heterogenous testing results are also presented in this section. Section 4 applies our general theory to various examples with different smoothness. Section 5 is devoted to the analysis of divide-and-conquer algorithms for homogeneous data. Section 6 presents some numerical experiments. All the technical details are deferred to Section 7 or Online Supplementary.

In this section, we briefly introduce the general RKHS theory, and then extend it to a partially linear function space. Below is a generic definition of RKHS (Berlinet and Thomas-Agnan, 2004):

Denote by $\mathcal{F}(\mathcal{S},\mathbb{R})$ a vector space of functions from a general set $\mathcal{S}$ to $\mathbb{R}$. We say that ${\mathcal{H}}^{\ast}$ is a reproducing kernel Hilbert space (RKHS) on $\mathcal{S}$, provided that:

- (i) ${\mathcal{H}}^{\ast}$ is a vector subspace of $\mathcal{F}(\mathcal{S},\mathbb{R})$;
- (ii) ${\mathcal{H}}^{\ast}$ is endowed with an inner product, denoted as ${\langle \cdot ,\cdot \rangle}_{{\mathcal{H}}^{\ast}}$, under which it becomes a Hilbert space;
- (iii) for every $y\in \mathcal{S}$, the linear evaluation functional defined by
*E*(_{y}*f*) =*f*(*y*) is bounded.

If ${\mathcal{H}}^{\ast}$ is a RKHS, by Riesz representation, we have that for every $y\in \mathcal{S}$, there exists a unique vector, ${K}_{y}^{\ast}\in {\mathcal{H}}^{\ast}$, such that for every $f\in {\mathcal{H}}^{\ast},\phantom{\rule{thickmathspace}{0ex}}f\left(y\right)={\langle f,{K}_{y}^{\ast}\rangle}_{{\mathcal{H}}^{\ast}}$. The reproducing kernel for ${\mathcal{H}}^{\ast}$ is defined as ${K}^{\ast}(x,y)={K}_{y}^{\ast}\left(x\right)$.

Denote $U\u2254(\mathit{X},Z)\in \mathcal{X}\times \mathcal{Z}\subset {\mathbb{R}}^{p}\times \mathbb{R}$, and ${\mathbb{P}}_{U}$ as the distribution of *U* (${\mathbb{P}}_{X}$ and ${\mathbb{P}}_{Z}$ are defined similarly). According to Definition 2.1, if $\mathcal{S}=\mathcal{Z}$ and $\mathcal{F}(\mathcal{Z},\mathbb{R})={L}_{2}\left({\mathbb{P}}_{Z}\right)$, then we can define a RKHS $\mathcal{H}\subset {L}_{2}\left({\mathbb{P}}_{Z}\right)$ (endowed with a proper inner product ${\langle \cdot ,\cdot \rangle}_{\mathcal{H}}$), in which the true function *f*_{0} is
believed to lie. The corresponding kernel for $\mathcal{H}$ is denoted by *K* such that for any $z\in \mathcal{Z}$: $f\left(z\right)={\langle f,{K}_{z}\rangle}_{\mathcal{H}}$. By Mercer theorem, this kernel function has the following
unique eigen-decomposition:

$$K({z}_{1},{z}_{2})=\sum _{\ell =1}^{\infty}{\mu}_{\ell}{\varphi}_{\ell}\left({z}_{1}\right){\varphi}_{\ell}\left({z}_{2}\right),$$

where *μ*_{1} ≥
*μ*_{2} ≥ ... > 0 are
eigenvalues and ${\left\{{\varphi}_{\ell}\right\}}_{\ell =1}^{\infty}$ are an orthonormal basis in ${L}_{2}\left({\mathbb{P}}_{Z}\right)$. Let ${\left\{{\theta}_{\ell}\right\}}_{\ell =1}^{\infty}$ be the Fourier coefficient of *f*_{0}
under the basis ${\left\{{\varphi}_{\ell}\right\}}_{\ell =1}^{\infty}$ (given the ${L}_{2}\left({\mathbb{P}}_{Z}\right)$-inner product ${\langle \cdot ,\cdot \rangle}_{{L}_{2}\left({\mathbb{P}}_{Z}\right)}$). Mercer theorem together with the reproducing property
implies that ${\langle {\varphi}_{i},{\varphi}_{j}\rangle}_{\mathcal{H}}={\delta}_{ij}\u2215{\mu}_{i}$, where *δ _{ij}* is the
Kronecker's delta. The smoothness of the functions in RKHS can be characterized
by the decaying rate of ${\left\{{\mu}_{\ell}\right\}}_{\ell =1}^{\infty}$. Below, we present three different decaying rates together
with the corresponding kernel functions.

the kernel has finite rank *r* if ${\mu}_{\ell}=0$ for all $\ell >r$. For example, the linear kernel $K({\mathit{z}}_{1},{\mathit{z}}_{2})={\langle {\mathit{z}}_{1},{\mathit{z}}_{2}\rangle}_{{\mathbb{R}}^{d}}$ has rank *d*, and generates a
*d*-dimensional linear function space. The eigenfunctions
are given by ${\varphi}_{\ell}\left(\mathit{z}\right)={z}_{\ell}$ for $\ell =1,\dots ,d$. The polynomial kernel
*K*(*z*_{1},
*z*_{2}) = (1 + *z*_{1}
*z*_{2})^{d} has rank
*d* + 1, and generates a space of polynomial functions
with degree at most *d*. The eigenfunctions are given by ${\varphi}_{\ell}={z}^{\ell -1}$ for $\ell =1,\dots ,d+1$.

the kernel has eigenvalues that satisfy ${\mu}_{\ell}\asymp {c}_{1}\phantom{\rule{thickmathspace}{0ex}}\mathrm{exp}(-{c}_{2}{\ell}^{p})$ for some *c*_{1}
*c*_{2} > 0. An example is the Gaussian kernel
*K*(*z*_{1},
*z*_{2}) =
exp(–|*z*_{1} –
*z*_{2}|^{2}). The eigenfunctions are
given by Sollich and Williams (2005)

$${\varphi}_{\ell}\left(x\right)={(\sqrt{5}\u22154)}^{1\u22154}{({2}^{\ell -2}(\ell -1)!)}^{-1\u22152}{e}^{-(\sqrt{5}-1){x}^{2}\u22154}{H}_{\ell -1}\left({(\sqrt{5}\u22152)}^{1\u22152}x\right),$$

(2.1)

for $\ell =1,2,\cdots $, where ${H}_{\ell}(\cdot )$ is the $\ell \text{-th}$ Hermite polynomial.

the kernel has eigenvalues that satisfy ${\mu}_{\ell}\asymp {\ell}^{2\nu}$ for some *ν* ≥ 1/2. Examples
include those underlying for Sobolev space and Besov space (Birman and Solomjak, 1967). In
particular, the eigenfunctions of a *ν*-th order
periodic Sobolev space are trigonometric functions as specified in Section
4.3. The corresponding Sobolev kernels are given in Gu (2013).

In this paper, we consider the following penalized estimation:

$$({\widehat{\beta}}^{\u2020},{\widehat{f}}^{\u2020})=\underset{(\beta ,f)\in \mathcal{A}}{\mathrm{argmin}}\left\{\frac{1}{n}\sum _{i=1}^{n}{({Y}_{i}-{\mathit{X}}_{i}^{T}\beta -f\left({Z}_{i}\right))}^{2}+\lambda {\Vert f\Vert}_{\mathcal{H}}^{2}\right\},$$

(2.2)

where *λ* > 0 is a
regularization parameter and $\mathcal{A}$ is defined as the parameter space ${\mathbb{R}}^{p}\times \mathcal{H}$. For simplicity, we do not distinguish $m=(\beta ,f)\in \mathcal{A}$ from its associated function $m\in \mathcal{M}\u2254\{{\mathit{x}}^{T}\beta +f\left(z\right):(\beta ,f)\in \mathcal{A}\phantom{\rule{1em}{0ex}}\text{and}\phantom{\rule{1em}{0ex}}(\mathit{x},z)\in \mathcal{X}\times \mathcal{Z}\}$ throughout the paper. We call $({\widehat{\beta}}^{\u2020},{\widehat{f}}^{\u2020})$ as partially linear kernel ridge regression (KRR) estimate
in comparison with the nonparametric KRR estimate in Shawe-Taylor and Cristianini (2004). In particular,
when $\mathcal{H}$ is a *ν*-th order Sobolev space
endowed with ${\langle f,\stackrel{~}{f}\rangle}_{\mathcal{H}}\u2254{\int}_{\mathcal{Z}}{f}^{\left(\nu \right)}\left(z\right){\stackrel{~}{f}}^{\left(\nu \right)}\left(z\right)dz$, $({\widehat{\beta}}^{\u2020},{\widehat{f}}^{\u2020})$ becomes the commonly used partial smoothing spline
estimate.

We next illustrate that $\mathcal{A}$ can be viewed as a partially linear extension of $\mathcal{H}$ in the sense that it shares some nice reproducing properties as this RKHS $\mathcal{H}$ under the following inner product:

$${\langle m,\stackrel{~}{m}\rangle}_{\mathcal{A}}={\langle {\mathit{X}}^{T}\beta +f\left(Z\right),{\mathit{X}}^{T}\stackrel{~}{\beta}+\stackrel{~}{f}\left(Z\right)\rangle}_{{L}_{2}\left({\mathbb{P}}_{U}\right)}+\lambda {\langle f,\stackrel{~}{f}\rangle}_{\mathcal{H}},$$

(2.3)

where $m=(\beta ,f)\in \mathcal{A}$ and $\stackrel{~}{m}=(\stackrel{~}{\beta},\stackrel{~}{f})\in \mathcal{A}$. Similar as the kernel function
*K _{z}*, we can construct a linear operator ${R}_{u}(\cdot )\in \mathcal{A}$ such that ${\langle {R}_{u},m\rangle}_{\mathcal{A}}=m\left(u\right)$ for any $u\in \mathcal{X}\times \mathcal{Z}$. Also, construct another linear operator ${P}_{\lambda}:\mathcal{A}\mapsto \mathcal{A}$ such that ${\langle {P}_{\lambda}m,\stackrel{~}{m}\rangle}_{\mathcal{A}}=\lambda {\langle f,\stackrel{~}{f}\rangle}_{\mathcal{H}}$ for any

We next present a proposition illustrating the rational behind the definition of ${\langle \cdot ,\cdot \rangle}_{\mathcal{A}}$. Denote $\otimes $ as the outer product on $\mathcal{A}$. Hence, ${\mathbb{E}}_{U}[{R}_{U}\otimes {R}_{U}]+{P}_{\lambda}$ is an operator from $\mathcal{A}$ to $\mathcal{A}$.

${\mathbb{E}}_{U}[{R}_{U}\otimes {R}_{U}]+{P}_{\lambda}=id$, where *id* is an identity operator on $\mathcal{A}$.

For any $m(\beta ,f)\in \mathcal{A}$ and $\stackrel{~}{m}=(\stackrel{~}{\beta},\stackrel{~}{f})\in \mathcal{A}$, we have

$$\begin{array}{cc}\hfill {\langle ({\mathbb{E}}_{U}[{R}_{U}\otimes {R}_{U}]+{P}_{\lambda})m,\stackrel{~}{m}\rangle}_{\mathcal{A}}=& {\langle {\mathbb{E}}_{U}[{R}_{U}\otimes {R}_{U}]m,\stackrel{~}{m}\rangle}_{\mathcal{A}}+{\langle {P}_{\lambda}m,\stackrel{~}{m}\rangle}_{\mathcal{A}}\hfill \\ \hfill =& {\mathbb{E}}_{U}\left[m\left(U\right)\stackrel{~}{m}\left(U\right)\right]+\lambda {\langle f,\stackrel{~}{f}\rangle}_{\mathcal{H}}\hfill \\ \hfill =& {\langle m,\stackrel{~}{m}\rangle}_{\mathcal{A}}.\hfill \end{array}$$

Since the choice of $(m,\stackrel{~}{m})$ is arbitrary, we conclude our proof.

As will be seen in the subsequent analysis, e.g., in Theorem 3.3, the operator $\mathbb{E}[{R}_{U}\otimes {R}_{U}]+{P}_{\lambda}$ is essentially the expectation of the Hessian of the objective function (w.r.t. Fréchet derivative) minimized in (2.2). Proposition 2.2 shows that the inversion of this Hessian matrix is trivial when the inner product is designed as in (2.3). Due to that, the theoretical analysis of $\widehat{m}=(\widehat{\beta},\widehat{f})$ based on the first order optimality condition becomes much more transparent.

To facilitate the construction of *R _{u}* and

$${\langle f,\stackrel{~}{f}\rangle}_{\mathcal{C}}={\langle f,\stackrel{~}{f}\rangle}_{{L}_{2}\left({\mathbb{P}}_{Z}\right)}+\lambda {\langle f,\stackrel{~}{f}\rangle}_{\mathcal{H}},$$

(2.4)

for any $f,\stackrel{~}{f}\in \mathcal{H}$. Under (2.4), $\mathcal{H}$ is still a RKHS as the evaluation functional is bounded by Lemma A.1. We denote the new kernel function as $\stackrel{~}{K}(\cdot ,\cdot )$, and define a positive definite self-adjoint operator ${W}_{\lambda}:\mathcal{H}\mapsto \mathcal{H}$:

$${\langle {W}_{\lambda}f,\stackrel{~}{f}\rangle}_{\mathcal{C}}=\lambda {\langle f,\stackrel{~}{f}\rangle}_{\mathcal{H}}\phantom{\rule{1em}{0ex}}\text{for any}\phantom{\rule{1em}{0ex}}f,\stackrel{~}{f}\in {\mathcal{H}}^{\prime},$$

(2.5)

whose existence is proven in Lemma A.2. Since ${\Vert f\Vert}_{{L}_{2}\left({\mathbb{P}}_{Z}\right)}^{2}={\sum}_{i=1}^{\infty}{\theta}_{i}^{2}$ and ${\Vert f\Vert}_{\mathcal{H}}^{2}={\sum}_{i=1}^{\infty}{\theta}_{i}^{2}\u2215{\mu}_{i}$, we now have ${\Vert f\Vert}_{\mathcal{C}}^{2}={\sum}_{i=1}^{\infty}{\theta}_{i}^{2}(1+\lambda \u2215{\mu}_{i})$ by (2.4).
We next define two crucial quantities needed in the construction: ${B}_{k}\u2254\mathbb{E}[{X}_{k}\mid Z]$ and its Riesz representer ${A}_{k}\in \mathcal{H}$ satisfying ${\langle {A}_{k},f\rangle}_{\mathcal{C}}={\langle {B}_{k},f\rangle}_{{L}_{2}\left({\mathbb{P}}_{Z}\right)}$ for all $f\in \mathcal{H}$. Here, we implicity assume *B _{k}*
is square integrable. The existence of

$$\mid {\mathcal{B}}_{k}f\mid =\mid {\langle {B}_{k},f\rangle}_{{L}_{2}\left({\mathbb{P}}_{Z}\right)}\mid \le {\Vert {B}_{k}\Vert}_{{L}_{2}\left({\mathbb{P}}_{Z}\right)}{\Vert f\Vert}_{{L}_{2}\left({\mathbb{P}}_{Z}\right)}\le {\Vert {B}_{k}\Vert}_{{L}_{2}\left({\mathbb{P}}_{Z}\right)}{\Vert f\Vert}_{\mathcal{C}}.$$

We are now ready to construct *R _{u}* and

For any *u* = (* x, z*),

$${L}_{u}={(\Omega +{\Sigma}_{\lambda})}^{-1}(\mathit{x}-\mathit{A}\left(z\right))\phantom{\rule{thickmathspace}{0ex}}\text{and}\phantom{\rule{thickmathspace}{0ex}}{N}_{u}={\stackrel{~}{K}}_{z}-{\mathit{A}}^{T}{L}_{u},$$

Moreover, for any $m=(\beta ,f)\in \mathcal{A}$, *P _{λ}m* can be expressed as ${P}_{\lambda}m=({L}_{\lambda}f,{N}_{\lambda f})\in \mathcal{A}$, where

$${L}_{\lambda}f=-{(\Omega +{\Sigma}_{\lambda})}^{-1}{\langle \mathit{B},{W}_{\lambda}f\rangle}_{{L}_{2}\left({\mathbb{P}}_{Z}\right)}\phantom{\rule{thickmathspace}{0ex}}\text{and}\phantom{\rule{thickmathspace}{0ex}}{N}_{\lambda}f={W}_{\lambda}f-{\mathit{A}}^{T}{L}_{\lambda}f.$$

Denote · _{2} and ·
_{∞} as the Euclidean
*L*_{2} and infinity norm in ${\mathbb{R}}^{p}$, respectively. For any function $f:\mathcal{Z}\mapsto \mathbb{R}$, let ${\Vert f\Vert}_{\mathrm{sup}}={\mathrm{sup}}_{z\in \mathcal{Z}}\mid f\left(z\right)\mid $. We use · to denote the spectral
norm of matrices. For positive sequences *a _{n}* and

$$\omega (\mathcal{F},\delta )={\int}_{0}^{\delta}\sqrt{\mathrm{log}\phantom{\rule{thinmathspace}{0ex}}\mathcal{N}(\mathcal{F},\Vert \cdot \Vert {}_{\mathrm{sup}},\u03f5)}d\u03f5,$$

where $\mathcal{N}(\mathcal{F},{\Vert \cdot \Vert}_{\mathrm{sup}},\u03f5)$ is an -covering number of $\mathcal{F}$ w.r.t. supreme norm. Define the following sets of functions: ${\mathcal{F}}_{1}=\{f\left(\mathit{x}\right)={\mathit{x}}^{T}\beta :{\Vert f\Vert}_{\mathrm{sup}}\le 1\phantom{\rule{thickmathspace}{0ex}}\text{for all}\phantom{\rule{thickmathspace}{0ex}}\mathit{x}\in \mathcal{X},\beta \in {\mathbb{R}}^{p}\}$, ${\mathcal{F}}_{2}=\{f\in \mathcal{H}:{\Vert f\Vert}_{\mathrm{sup}}\le 1,{\Vert f\Vert}_{\mathcal{H}}\le {h}^{1\u22152}{\lambda}^{-1\u22152}\}$, $\mathcal{F}\u2254\{f={f}_{1}+{f}_{2}:{f}_{1}\in {\mathcal{F}}_{1},{f}_{2}\in {\mathcal{F}}_{2},{\Vert f\Vert}_{\mathrm{sup}}\le 1\u22152\}$.

In this section, we start from describing our aggregation procedure and model assumptions in Section 3.1. The main theoretical results are presented in Sections 3.2–3.4 showing that our combined estimate for commonality enjoys the “oracle property”. To be more specific, we show that it possesses the same (non-asymptotic) minimax optimal bound (in terms of mean-squared error) and asymptotic distribution as the “oracle estimate” ${\widehat{f}}_{or}$ computed when all the heterogeneity information are available:

$${\widehat{f}}_{or}=\underset{f\in \mathcal{H}}{\mathrm{argmin}}\left\{\frac{1}{N}\sum _{j=1}^{s}\sum _{i\in {L}_{j}}{({Y}_{i}-{\left({\beta}_{0}^{\left(j\right)}\right)}^{T}{\mathit{X}}_{i}-f\left({Z}_{i}\right))}^{2}+\lambda {\Vert f\Vert}_{\mathcal{H}}^{2}\right\}.$$

(3.1)

The above nice properties hold when the number of sub-populations
does not grow too fast and the smoothing parameter is chosen according to the entire
sample size *N*. Based on this combined estimator, we further
construct a plug-in estimator for each heterogeneity parameter ${\beta}_{0}^{\left(j\right)}$, which possesses the asymptotic distribution as if the commonality
were known, in Section 3.5. Interestingly, this oracular result holds when the
number of sub-population is not too small. In the end, Section 3.6 tests the
possible heterogeneity among a large number of sub-populations.

The heterogeneous data setup and averaging procedure are described below:

1. Obverse data (* X_{i}, Z_{i},
Y_{i}*) with the known label

2. On the *j*-th sub-population, obtain the following
penalized estimator:

$$({\widehat{\beta}}_{n,\lambda}^{\left(j\right)},{\widehat{f}}_{n,\lambda}^{\left(j\right)}=\underset{(\beta ,f)\in \mathcal{A}}{\mathrm{argmin}}\left\{\frac{1}{n}\sum _{i\in {L}_{j}}{({\mathit{X}}_{i}^{T}\beta +f\left({Z}_{i}\right)-{Y}_{i})}^{2}+\lambda {\Vert f\Vert}_{\mathcal{H}}^{2}\right\}.$$

(3.2)

3. Obtain the final nonparametric estimate^{1} for commonality by averaging:

$${\stackrel{\u2012}{f}}_{N,\lambda}=\frac{1}{s}\sum _{j=1}^{s}{\widehat{f}}_{n,\lambda}^{\left(j\right)}.$$

(3.3)

We point out that ${\widehat{\beta}}_{n,\lambda}^{\left(j\right)}$ is not our final estimate for heterogeneity. In fact, it can
be further improved based on * _{N,λ}*;
see Section 3.5.

For simplicity, we will drop the subscripts (*n,
λ*) and (*N, λ*) in those notation
defined in (3.2) and (3.3) throughout the rest of this
paper. The main assumptions of this section are stated below.

(i) *ε _{i}*'s are i.i.d. sub-Gaussian
random variables independent of the designs; (ii) ${B}_{k}\in {L}_{2}\left({\mathbb{P}}_{Z}\right)$ for all

Conditions in Assumption 3.1 are fairly standard in the literature.
For example, the positive definiteness of Ω is needed for obtaining
semiparametric efficient estimation; see Mammen and van de Geer (1997). Note that we do not require the
independence between * X* and

We assume that there exist 0 <
*c _{ϕ}* < ∞ and 0 <

Assumption 3.2 is commonly assumed in kernel ridge regression
literature (Zhang et al., 2013; Lafferty and Lebanon, 2005; Guo, 2002). In the case of finite rank
kernel, e.g., linear and polynomial kernels, the eigenfunctions are
uniformly bounded as long as *Ƶ* has finite support.
As for the exponentially decaying kernels such as Gaussian kernel, we prove
in Section 4.2 that the eigenfunctions given in (2.1) are uniformly bounded by
1.336. Lastly, for the polynomially decaying kernels, Proposition 2.2 in
Shang and Cheng (2013) showed
that the eigenfunctions induced from a *ν*-th order
Sobolev space (under a proper inner product ${\langle \cdot ,\cdot \rangle}_{\mathcal{H}}$) are uniformly bounded under mild smoothness conditions
for the density of *Z*.

For each $k=1,\dots ,p,\phantom{\rule{thickmathspace}{0ex}}{B}_{k}(\cdot )\in \mathcal{H}$. This is equivalent to

$$\sum _{\ell =1}^{\infty}{\mu}_{\ell}^{-1}{\langle {B}_{k},{\varphi}_{\ell}\rangle}_{{L}_{2}\left({\mathbb{P}}_{Z}\right)}^{2}<\infty .$$

Assumption 3.3 requires the conditional expectation of
*X _{k}* given

The primary goal of this section is to evaluate the estimation quality
of the combined estimate from a *non-asymptotic* point of view.
Specifically, we derive a finite sample upper bound for the mean-squared error $\mathrm{MSE}\left(\stackrel{\u2012}{f}\right)\u2254\mathbb{E}\left[{\Vert \stackrel{\u2012}{f}-{f}_{0}\Vert}_{{L}_{2}\left({\mathbb{P}}_{Z}\right)}^{2}\right]$. When *s* does not grow too fast, we show that
MSE() is of the order $O({\left(Nh\right)}^{-1}+\lambda )$, from which the aggregation effect on *f* can
be clearly seen. If *λ* is chosen in the order of
*N*, the mean-squared error attains the (un-improvable)
optimal minimax rate. As a by-product, we establish a
*non-asymptotic* upper bound for the mean-squared error of ${\widehat{\beta}}^{\left(j\right)}$, i.e., $\mathrm{MSE}\left({\widehat{\beta}}^{\left(j\right)}\right)\u2254\mathbb{E}\left[{\Vert {\widehat{\beta}}^{\left(j\right)}-{\beta}_{0}^{\left(j\right)}\Vert}_{2}^{2}\right]$. The results in this section together with Theorem 3.4 in
Section 3.4 determine an upper bound of *s* under which
enjoys the same statistical properties (minimax optimality and asymptotic
distribution) as the oracle estimate _{or}.

Define *τ*_{min}(Ω) as the minimum
eigenvalue of Ω and $\mathrm{Tr}\left(K\right)\u2254{\sum}_{\ell =1}^{\infty}{\mu}_{\ell}$ as the trace of *K*. Moreover, let ${C}_{1}^{\prime}=2{\tau}_{\mathrm{min}}^{-2}\left(\Omega \right)({c}_{x}^{2}p+{c}_{\varphi}^{2}\phantom{\rule{thickmathspace}{0ex}}\mathrm{Tr}\left(K\right){\sum}_{k=1}^{p}{\Vert {B}_{k}\Vert}_{\mathcal{H}}^{2})$, ${C}_{1}=2{c}_{\varphi}^{2}(1+{C}_{1}^{\prime}{\sum}_{k=1}^{p}{\Vert {B}_{k}\Vert}_{{L}_{2}\left({\mathbb{P}}_{Z}\right)}^{2})$, ${C}_{2}^{\prime}={\tau}_{\mathrm{min}}^{-2}\left(\Omega \right){\Vert {f}_{0}\Vert}_{\mathcal{H}}^{2}{\sum}_{k=1}^{p}{\Vert {B}_{k}\Vert}_{\mathcal{H}}^{2}$ and ${C}_{2}=2{C}_{2}^{\prime}{\sum}_{k=1}^{p}{\Vert {B}_{k}\Vert}_{{L}_{2}\left({\mathbb{P}}_{Z}\right)}^{2}$.

Suppose that Assumptions 3.1–3.3 hold. If $s=o\left(N{h}^{2}{(\omega (\mathcal{F},1)+\mathrm{log}\phantom{\rule{thinmathspace}{0ex}}N)}^{-2}\right)$, then we have

$$\mathrm{MSE}\left(\stackrel{\u2012}{f}\right)\le {C}_{1}{\sigma}^{2}{\left(Nh\right)}^{-1}+2{\Vert {f}_{0}\Vert}_{\mathcal{H}}^{2}\lambda +{C}_{2}{\lambda}^{2}+{s}^{-1}a(n,s,h,\lambda ,\omega ),$$

(3.4)

where $a(n,s,h,\lambda ,\omega )=C{h}^{-1}{n}^{-1}{r}_{n,s}^{2}(\omega {(\mathcal{F},1)}^{2}+1)+C{h}^{-2}{\lambda}^{-1}n\phantom{\rule{thinmathspace}{0ex}}\mathrm{exp}(-c\phantom{\rule{thinmathspace}{0ex}}{\mathrm{log}}^{2}N)$, *r _{n,s}* =
(

Typically, we require an upper bound for *s* so that the
third term in the R.H.S. of (3.4) can be dominated by the first two terms, which correspond to
variance and bias, respectively. Hence, we choose $\lambda \asymp {\left(Nh\right)}^{-1}$ to attain the optimal *bias-variance
trade-off*. The resulting rate coincides with the minimax optimal rate
of the oracle estimate in different RKHS; see Section 4. This can be viewed as a
non-asymptotic version of the “oracle property” of
. In comparison with the nonparametric KRR result
in Zhang et al. (2013), we realize that
adding one parametric component does not affect the finite sample upper bound
(3.4).

As a by-product, we obtain a *non-asymptotic* upper bound
for $\mathrm{MSE}\left({\widehat{\beta}}^{\left(j\right)}\right)$. This result is new, and also of independent interest.

Suppose that Assumptions 3.1 – 3.3 hold. Then we have

$$\mathrm{MSE}\left({\widehat{\beta}}^{\left(j\right)}\right)\le {C}_{1}^{\prime}{\sigma}^{2}{n}^{-1}+{C}_{2}^{\prime}{\lambda}^{2}+a(n,s,h,\lambda ,\omega ),$$

(3.5)

where *a*(*n, s, h, λ,
ω*) is defined in Theorem 3.1.

Again, the first term and second term in the R.H.S. of (3.5) correspond to the
variance and bias, respectively. In particular, the second term comes from
the bias propagation effect to be discussed in Section 3.4. By choosing
*λ*=
*o*(*n*^{−1/2}), we can
obtain the optimal rate of $\mathrm{MSE}\left({\widehat{\beta}}^{\left(j\right)}\right)$, i.e.,
*O*(*n*^{−1/2}), but may
lose the minimax optimality of MSE() in most
cases.

In this section, we derive a preliminary result on the joint limit distribution of $({\widehat{\beta}}^{\left(j\right)},\stackrel{\u2012}{f}\left({z}_{0}\right))$ at any ${z}_{0}\in \mathcal{Z}$. A key issue with this result is that their centering is not at the true value. However, we still choose to present it here since we will observe an interesting phenomenon when removing the bias in Section 3.4.

Suppose that Assumptions 3.1 and 3.2 hold, and that as $N\to \infty ,h{\Vert {\stackrel{~}{K}}_{{z}_{0}}\Vert}_{{L}_{2}\left({\mathbb{P}}_{Z}\right)}^{2}\to {\sigma}_{{z}_{0}}^{2}$, ${h}^{1\u22152}\left({W}_{\lambda}\mathit{A}\right)\left({z}_{0}\right)\to {\alpha}_{{z}_{0}}\in {\mathbb{R}}^{p}$, and ${h}^{1\u22152}\mathit{A}\left({z}_{0}\right)\to -{\gamma}_{{z}_{0}}\in {\mathbb{R}}^{p}$. Suppose the following conditions are satisfied

$$s=o\left(N{h}^{2}{(\omega (\mathcal{F},1)+\mathrm{log}\phantom{\rule{thinmathspace}{0ex}}N)}^{-2}\right),$$

(3.6)

$$s{\left(Nh\right)}^{-1}\phantom{\rule{thinmathspace}{0ex}}{\mathrm{log}}^{4}\phantom{\rule{thinmathspace}{0ex}}N+\lambda =o\left({h}^{2}{(\omega (\mathcal{F},1)+\mathrm{log}\phantom{\rule{thinmathspace}{0ex}}N)}^{-2}\phantom{\rule{thinmathspace}{0ex}}{\mathrm{log}}^{-2}\left(N\right)\right).$$

(3.7)

Denote $({\beta}_{0}^{\left(j\right)\ast},{f}_{0}^{\ast})$ as $(id-{P}_{\lambda}){m}_{0}^{\left(j\right)}$, where ${m}_{0}^{\left(j\right)}=({\beta}_{0}^{\left(j\right)},{f}_{0})$. We have for any *z*_{0}
*Ƶ* and *j* = 1,...,*s*,

- (i)if
*s*→ ∞ thenwhere ${\Sigma}_{22}={\sigma}_{{z}_{0}}^{2}+2{\gamma}_{{z}_{0}}^{T}{\Omega}^{-1}{\alpha}_{{z}_{0}}+{\gamma}_{{z}_{0}}^{T}{\Omega}^{-1}{\gamma}_{{z}_{0}}$;$$\left(\begin{array}{c}\hfill \sqrt{n}({\widehat{\beta}}^{\left(j\right)}-{\beta}_{0}^{\left(j\right)\ast}\hfill \\ \hfill \sqrt{Nh}(\stackrel{\u2012}{f}\left({z}_{0}\right)-{f}_{0}^{\ast}\left({z}_{0}\right))\hfill \end{array}\right)\u21ddN\left(0,{\sigma}^{2}\left(\begin{array}{cc}\hfill {\Omega}^{-1}\hfill & \hfill 0\hfill \\ \hfill 0\hfill & \hfill {\Sigma}_{22}\hfill \end{array}\right)\right),$$(3.8) - (ii)if
*s*is fixed, thenwhere ${\Sigma}_{12}={\Sigma}_{12}^{T}={\Omega}^{-1}({\alpha}_{{z}_{0}}+{\gamma}_{{z}_{0}})$.$$\left(\begin{array}{c}\hfill \sqrt{n}({\widehat{\beta}}^{\left(j\right)}-{\beta}_{0}^{\left(j\right)\ast})\hfill \\ \hfill \sqrt{Nh}(\stackrel{\u2012}{f}\left({z}_{0}\right)-{f}_{0}^{\ast}\left({z}_{0}\right))\hfill \end{array}\right)\u21ddN\left(0,{\sigma}^{2}\left(\begin{array}{cc}\hfill {\Omega}^{-1}\hfill & \hfill {s}^{-1\u22152}{\Sigma}_{21}\hfill \\ \hfill {s}^{-1\u22152}{\Sigma}_{12}\hfill & \hfill {\Sigma}_{22}\hfill \end{array}\right)\right),$$(3.9)

Part (i) of Theorem 3.3 says that $\sqrt{n}{\widehat{\beta}}^{\left(j\right)}$ and $\sqrt{Nh}\stackrel{\u2012}{f}\left({z}_{0}\right)$ are asymptotically independent as *s* →
∞. This is not surprising since only samples in one sub-population (with
size *n*) contribute to the estimation of the heterogeneity
component while the entire sample (with size *N*) to commonality.
As *n/N* = *s*^{−1} → 0, the
former data becomes asymptotically independent of (or asymptotically ignorable
to) the latter data. So are these two estimators. The estimation bias ${P}_{\lambda}{m}_{0}^{\left(j\right)}$ can be removed by placing a smoothness condition on
*B _{k}*, i.e., Assumption 3.3. Interestingly,
given this additional condition, even when

In this section, we first analyze the source of estimation bias observed
in the joint asymptotics Theorem 3.3. In fact, these analysis leads to a bias
propagation phenomenon, which intuitively explains how Assumption 3.3 removes
the estimation bias. More importantly, we show that
shares exactly the same asymptotic distribution as
* _{or}*, i.e., oracle rule, when

Our study on propagation mechanism is motivated by the following simple
observation. Denote $\mathbb{X}\in {\mathbb{R}}^{n\times p}$ and $\mathbb{Z}\in {\mathbb{R}}^{n}$ as the designs based on the samples from the
*j*th sub-population and let ${\epsilon}^{\left(j\right)}={\left[{\u03f5}_{i}\right]}_{i\in {L}_{j}}\in {\mathbb{R}}^{n}$. The first order optimality condition (w.r.t.
* β*) gives

$${\widehat{\beta}}^{\left(j\right)}-{\beta}_{0}^{\left(j\right)}={\left({\mathbb{X}}^{T}\mathbb{X}\right)}^{-1}{\mathbb{X}}^{T}{\epsilon}^{\left(j\right)}-{\left({\mathbb{X}}^{T}\mathbb{X}\right)}^{-1}{\mathbb{X}}^{T}({\widehat{f}}^{\left(j\right)}\left(\mathbb{Z}\right)-{f}_{0}\left(\mathbb{Z}\right)),$$

(3.10)

where ${f}_{0}\left(\mathbb{Z}\right)$ is a *n*-dimensional vector with entries
*f*_{0}(*Z _{i}*) for

$$\text{parametric bias}:{L}_{\lambda}{f}_{0}=-{(\Omega +{\Sigma}_{\lambda})}^{-1}{\langle \mathit{B},{W}_{\lambda}{f}_{0}\rangle}_{{L}_{2}\left({\mathbb{P}}_{Z}\right)},$$

(3.11)

$$\text{nonparametric bias}:{N}_{\lambda}{f}_{0}={W}_{\lambda}{f}_{0}-{\mathit{A}}^{T}{L}_{\lambda}{f}_{0}$$

(3.12)

according to Proposition 2.3. The first term in (3.12) explains the bias
introduced by penalization; see (2.5). This bias propagates to the parametric component through
* B*, as illustrated in (3.11). The parametric bias

We summarize the above discussions in the following theorem:

Suppose Assumption 3.3 and the conditions in Theorem 3.3 hold. If we choose $\lambda =o({\left(Nh\right)}^{-1\u22152}\wedge {n}^{-1\u22152})$, then

- (i)if
*s*→ ∞ thenwhere ${\Sigma}_{22}^{\ast}={\sigma}_{{z}_{0}}^{2}+{\gamma}_{{z}_{0}}^{T}{\Omega}^{-1}{\gamma}_{{z}_{0}}$;$$\left(\begin{array}{c}\hfill \sqrt{n}({\widehat{\beta}}^{\left(j\right)}-{\beta}_{0}^{\left(j\right)})\hfill \\ \phantom{\rule{0ex}{0ex}}\hfill \sqrt{Nh}(\stackrel{\u2012}{f}\left({z}_{0}\right)-{f}_{0}\left({z}_{0}\right)-{W}_{\lambda}{f}_{0}\left({z}_{0}\right))\hfill \end{array}\right)\phantom{\rule{0ex}{0ex}}\u21ddN\left(0,{\sigma}^{2}\left(\begin{array}{cc}\hfill {\Omega}^{-1}\hfill & \hfill 0\hfill \\ \hfill 0\hfill & \hfill {\Sigma}_{22}^{\ast}\hfill \end{array}\right)\right),$$(3.13) - (ii)if
*s*is fixed, thenwhere ${\Sigma}_{12}^{\ast}={\Sigma}_{12}^{\ast T}={\Omega}^{-1}{\gamma}_{{z}_{0}}$ and ${\Sigma}_{22}^{\ast}$ is the same as in (i).$$\left(\begin{array}{c}\hfill \sqrt{n}({\widehat{\beta}}^{\left(j\right)}-{\beta}_{0}^{\left(j\right)})\hfill \\ \hfill \sqrt{Nh}(\stackrel{\u2012}{f}\left({z}_{0}\right)-{f}_{0}\left({z}_{0}\right)-{W}_{\lambda}{f}_{0}\left({z}_{0}\right))\hfill \end{array}\right)\u21ddN\left(0,{\sigma}^{2}\left(\begin{array}{cc}\hfill {\Omega}^{-1}\hfill & \hfill {s}^{-1\u22152}{\Sigma}_{21}^{\ast}\hfill \\ \hfill {s}^{-1\u22152}{\Sigma}_{12}^{\ast}\hfill & \hfill {\Sigma}_{22}^{\ast}\hfill \end{array}\right)\right),$$(3.14)

Moreover, if *h* → 0, then ${\Sigma}_{12}^{\ast}={\Sigma}_{21}^{\ast}=0$ and ${\Sigma}_{22}^{\ast}={\sigma}_{{z}_{0}}^{2}$ in (i) and (ii).

The nonparametric estimation bias
*W _{λ}f*

By examining the proof for case (ii) of Theorem 3.4 (and taking
*s* = 1), we know that the oracle estimate
* _{or}* defined in (3.1) attains the same
asymptotic distribution as that of in (3.13) when

In Section 4, we apply Theorem 3.4 to several examples, and find
that even though the minimization (3.2) is based only on one fraction of the entire sample, it is
nonetheless essential to regularize each sub-estimation as if it had the
entire sample. In other words, *λ* should be chosen in
the order of *N*. Similar phenomenon also arises in analyzing
minimax optimality of each sub-estimation; see Section 3.2.

Cheng and Shang (2013) have
recently uncovered a *joint asymptotics phenomenon* in
partial smoothing spline models: parametric estimate and (point-wise)
nonparametric smoothing spline estimate become asymptotically independent
after the parametric bias is removed. This corresponds to a special case of
Part (ii) of Theorem 3.4 for polynomially decaying kernels with
*s* = 1 and *h* → ∞.
Therefore, case (ii) in Theorem 3.4 generalizes this new phenomenon to the
partially linear kernel ridge regression models. When $h\nrightarrow 0$, e.g., $h\asymp {r}^{-1}$ for finite rank kernel, the semi-nonparametric estimation
in consideration essentially reduces to a parametric one. Hence, it is not
surprising that the asymptotic dependence remains.

Theorem 3.4 implies that $\sqrt{n}({\widehat{\beta}}^{\left(j\right)}-{\beta}_{0}^{\left(j\right)})\u21ddN(0,{\sigma}^{2}{\Omega}^{-1})$ when *λ* =
*o*(*n*^{−1/2}). When the
error follows a Gaussian distribution, it is well
known that ${\widehat{\beta}}^{\left(j\right)}$ achieves the semiparametric efficiency bound (Kosorok, 2007). Hence, the
semiparametric efficient estimate can be the obtained by applying the kernel
ridge method. However, we can further improve its estimation efficiency to a
parametric level by taking advantage of (built on
the whole samples). This is one important feature of massive data:
strength-borrowing.

The previous sections show that the combined estimate
achieves the “oracle property” in
both asymptotic and non-asymptotic senses when *s* does not grow
too fast and *λ* is chosen according to the entire sample
size. In this section, we employ to boost the
estimation efficiency of ${\widehat{\beta}}^{\left(j\right)}$ from semiparametric level to parametric level. This leads to
our final estimate for heterogeneity, i.e., ${\stackrel{\u02c7}{\beta}}^{\left(j\right)}$ defined in (3.15). More importantly, ${\stackrel{\u02c7}{\beta}}^{\left(j\right)}$ possesses the limit distribution as if the commonality in each
sub-population were known, and hence satisfies the “oracle rule”.
This interesting efficiency boosting phenomenon will be empirically verified in
Section 6.

Specifically, we define the following improved estimator for
**β**_{0}:

$${\stackrel{\u02c7}{\beta}}^{\left(j\right)}=\underset{\beta \in {\mathbb{R}}^{p}}{\mathrm{argmin}}\frac{1}{n}\sum _{i\in {L}_{j}}{({Y}_{i}-{\mathit{X}}_{i}^{T}\beta -\stackrel{\u2012}{f}\left({Z}_{i}\right))}^{2}.$$

(3.15)

Theorem 3.5 below shows that ${\stackrel{\u02c7}{\beta}}^{\left(j\right)}$ achieves the parametric efficiency bound as if the
nonparametric component *f* were known. This is not surprising
given that the nonparametric estimate now possesses a
faster convergence rate after aggregation. What is truly interesting is that we
need to set a lower bound for *s*, i.e., (3.16), and thus the homogeneous
data setting is trivially excluded. This lower bound requirement slows down the
convergence rate of ${\stackrel{\u02c7}{\beta}}^{\left(j\right)}$, i.e., √*n*, such that
can be treated as if it were known.

Suppose Assumption 3.1 and 3.2 hold. If *s* satisfies

$${s}^{-1}=o\left({h}^{2}\phantom{\rule{thinmathspace}{0ex}}{\mathrm{log}}^{-4}\phantom{\rule{thinmathspace}{0ex}}N\right),$$

(3.16)

$$s=o\left(N{h}^{2}{(\omega (\mathcal{F},1)+\mathrm{log}\phantom{\rule{thinmathspace}{0ex}}N)}^{-2}\right),$$

(3.17)

$$s{\left(Nh\right)}^{-1}\phantom{\rule{thinmathspace}{0ex}}{\mathrm{log}}^{4}\phantom{\rule{thinmathspace}{0ex}}N+\lambda =o\left({h}^{2}{(\omega (\mathcal{F},1)+\mathrm{log}\phantom{\rule{thinmathspace}{0ex}}N)}^{-2}\phantom{\rule{thinmathspace}{0ex}}{\mathrm{log}}^{-2}\phantom{\rule{thinmathspace}{0ex}}N\right),$$

(3.18)

and we choose *λ* =
*o*((*Nh*)^{−1}), then we
have

$$\sqrt{n}({\stackrel{\u02c7}{\beta}}^{\left(j\right)}-{\beta}_{0}^{\left(j\right)})\u21ddN(0,{\sigma}^{2}{\Sigma}^{-1}),$$

where $\Sigma =\mathbb{E}\left[\mathit{X}{\mathit{X}}^{T}\right]$.

Recall that * X* and

The heterogeneity across different sub-populations is a crucial feature of massive data. However, there is still some chance that some sub-populations may share the same underlying distribution. In this section, we consider testing for the heterogeneity among sub-populations. We start from a simple pairwise testing, and then extend it to a more challenging simultaneous testing that can be applied to a large number of sub-populations.

Consider a general class of pairwise heterogeneity testing:

$${H}_{0}:Q({\beta}_{0}^{\left(j\right)}-{\beta}_{0}^{\left(k\right)})=0\phantom{\rule{1em}{0ex}}\text{for}\phantom{\rule{1em}{0ex}}j\ne k,$$

(3.19)

where $Q={({Q}_{1}^{T},\dots ,{Q}_{q}^{T})}^{T}$ is a *q × q* matrix with *q
≥ p*. The general formulation (3.19) can test either the whole vector or one fraction of ${\beta}_{0}^{\left(j\right)}$ is equal to that of ${\beta}_{0}^{\left(k\right)}$. A test statistic can be constructed based on either $\widehat{\beta}$ or its improved version $\stackrel{\u02c7}{\beta}$. Let ${C}_{\alpha}\subset {\mathbb{R}}^{q}$ be a confidence region satisfying $\mathbb{P}(\mathit{b}\in {C}_{\alpha})=1-\alpha $ for any * b* ~

$$\begin{array}{cc}\hfill {\Psi}_{1}=& I\{Q({\widehat{\beta}}^{\left(j\right)}-{\widehat{\beta}}^{\left(k\right)})\notin \sqrt{2\u2215n}\sigma {\left(Q{\Omega}^{-1}{Q}^{T}\right)}^{1\u22152}{C}_{\alpha}\},\hfill \\ \hfill {\Psi}_{2}=& I\{Q({\stackrel{\u02c7}{\beta}}^{\left(j\right)}-{\stackrel{\u02c7}{\beta}}^{\left(k\right)})\notin \sqrt{2\u2215n}\sigma {\left(Q{\Sigma}^{-1}{Q}^{T}\right)}^{1\u22152}{C}_{\alpha}\}.\hfill \end{array}$$

The consistency of the above tests are guaranteed by Theorem 3.6
below. In addition, we note that the power of the latter test is larger than the
former; see the analysis below Theorem 3.6. The price we need to pay for this
larger power is to require a lower bound on *s*.

Suppose that the conditions in Theorem 3.4 are satisfied. Under the null hypothesis specified in (3.19), we have

$$\sqrt{n}Q({\widehat{\beta}}^{\left(j\right)}-{\widehat{\beta}}^{\left(k\right)})\u21ddN(0,2{\sigma}^{2}Q{\Omega}^{-1}{Q}^{T}).$$

Moreover, under the conditions in Theorem 3.5, we have

$$\sqrt{n}Q({\stackrel{\u02c7}{\beta}}^{\left(j\right)}-{\stackrel{\u02c7}{\beta}}^{\left(k\right)})\u21ddN(0,2{\sigma}^{2}Q{\Sigma}^{-1}{Q}^{T}),$$

where $\Sigma =\mathbb{E}\left[\mathit{X}{\mathit{X}}^{T}\right]$.

The larger power of **Ψ**_{2} is due to the
smaller asymptotic variance of ${\stackrel{\u02c7}{\beta}}^{\left(j\right)}$, and can be deduced from the following power function. For
simplicity, we consider ${H}_{0}:{\beta}_{01}^{\left(j\right)}-{\beta}_{01}^{\left(k\right)}=0$, i.e., *Q* = (1, 0, 0 ..., 0). In this
case, we have ${\Psi}_{1}=I\{\mid {\widehat{\beta}}_{1}^{\left(j\right)}-{\widehat{\beta}}_{1}^{\left(k\right)}\mid >\sqrt{2}\sigma {\left[{\Omega}^{-1}\right]}_{11}^{1\u22152}{z}_{\alpha \u22152}\u2215\sqrt{n}\}$, and ${\Psi}_{2}=I\{\mid {\stackrel{\u02c7}{\beta}}_{1}^{\left(j\right)}-{\stackrel{\u02c7}{\beta}}_{1}^{\left(k\right)}\mid >\sqrt{2}\sigma {\left[{\Sigma}^{-1}\right]}_{11}^{1\u22152}{z}_{\alpha \u22152}\u2215\sqrt{n}\}$. The (asymptotic) power function under the alternative
that ${\beta}_{01}^{\left(j\right)}-{\beta}_{01}^{\left(k\right)}={\beta}^{\ast}$ for some non-zero *β** is

$$\text{Power}\left({\beta}^{\ast}\right)=1-\mathbb{P}\left(W\in \left[-\frac{{\beta}^{\ast}\sqrt{n}}{{\sigma}^{\ast}}\pm {z}_{\alpha \u22152}\right]\right),$$

where *W* ~ *N*(0, 1)
and *σ** is $\sqrt{2}\sigma {\left[{\Omega}^{-1}\right]}_{11}^{1\u22152}$ for **Ψ**_{1} and $\sqrt{2}\sigma {\left[{\Sigma}^{-1}\right]}_{11}^{1\u22152}$ for **Ψ**_{2}. Hence, a smaller
*σ** gives rise to a larger power, and
**Ψ**_{2} is more powerful than
**Ψ**_{1}. Please see Section 6 for empirical
support for this power comparison.

We next consider a simultaneous testing that is applied to a large number of sub-populations:

$${H}_{0}:{\beta}^{\left(j\right)}={\stackrel{~}{\beta}}^{\left(j\right)}\phantom{\rule{1em}{0ex}}\text{for all}\phantom{\rule{1em}{0ex}}j\in \mathcal{G},$$

(3.20)

where $\mathcal{G}\subset \{1,2,\dots ,s\}$, versus the alternative:

$${H}_{1}:{\beta}^{\left(j\right)}\ne {\stackrel{~}{\beta}}^{\left(j\right)}\phantom{\rule{1em}{0ex}}\text{for some}\phantom{\rule{1em}{0ex}}j\in \mathcal{G}.$$

(3.21)

The above ${\widehat{\beta}}^{\left(j\right)}$'s are pre-specified for each $j\in \mathcal{G}$. If all ${\stackrel{~}{\beta}}^{\left(j\right)}$'s are the same, then it becomes a type of heterogeneity
test for the group of sub-populations indexed by $\mathcal{G}$. Here we allow $\mid \mathcal{G}\mid $ to be as large as *s*, and thus it can
increase with *n*. Let ${\widehat{\Sigma}}^{\left(j\right)}$ be the sample covariance matrix of
** X** for the

$${T}_{\mathcal{G}}\u2254\underset{j\in \mathcal{G},1\le k\le p}{\mathrm{max}}\sqrt{n}({\stackrel{\u02c7}{\beta}}_{k}^{\left(j\right)}-{\stackrel{~}{\beta}}_{k}^{\left(j\right)}).$$

We approximate the distribution of the above test statistic using multiplier bootstrap. Define the following quantity:

$${W}_{\mathcal{G}}\u2254\underset{j\in \mathcal{G},1\le k\le p}{\mathrm{max}}\frac{1}{\sqrt{n}}\sum _{i\in {L}_{j}}{\left({\widehat{\Sigma}}^{\left(j\right)}\right)}_{k}^{-1}{\mathit{X}}_{i}{e}_{i},$$

where *e _{i}*'s are i.i.d.

Suppose Assumptions 3.1 and 3.2 hold. In addition, suppose (3.17) and (3.18) in Theorem 3.5 hold.
For any $\mathcal{G}\subset \{1,2,\dots ,s\}$ with $\mid \mathcal{G}\mid =d$, if (i) $s\gtrsim {h}^{-2}\phantom{\rule{thinmathspace}{0ex}}\mathrm{log}\left(pd\right)\phantom{\rule{thinmathspace}{0ex}}{\mathrm{log}}^{4}\phantom{\rule{thinmathspace}{0ex}}N$, (ii) ${\left(\mathrm{log}\left(pdn\right)\right)}^{7}\u2215n\le {C}_{1}{n}^{-{c}_{1}}$ for some constants *c*_{1},
*C*_{1} > 0, and (iii) ${p}^{2}\phantom{\rule{thinmathspace}{0ex}}\mathrm{log}\left(pd\right)\u2215\sqrt{n}=o\left(1\right)$, then under *H*_{0} and choosing
*λ* =
*o*((*Nh*)^{−1}), we have

$$\underset{\alpha \in (0,1)}{\mathrm{sup}}\mid \mathbb{P}\left({T}_{\mathcal{G}}>{c}_{\mathcal{G}}\left(\alpha \right)\right)-\alpha \mid =o\left(1\right).$$

We can perform heterogeneity testing even without specifying ${\stackrel{~}{\beta}}^{\left(j\right)}$'s. This can be done by simply reformulating the null
hypothesis as follows (for simplicity we set $\mathcal{G}=\left[s\right]$): *H*_{0} :
*α*^{(j)} = 0 for
*j* [*s* – 1], where
*α*^{(j)} =
*β*^{(j)}
–
**β**^{(j+1)}
for *j* = 1,...,*s* – 1. The test
statistic is ${T}_{\mathcal{G}}^{\prime}={\mathrm{max}}_{1\le j\le s-1}\phantom{\rule{thinmathspace}{0ex}}{\mathrm{max}}_{1\le k\le p}{\alpha}_{k}^{\left(j\right)}$. The bootstrap quantity is defined as

$${W}_{\mathcal{G}}^{\prime}\u2254\underset{1\le j\le s-1,1\le k\le p}{\mathrm{max}}\frac{1}{\sqrt{n}}\sum _{i\in {L}_{j}}{\left({\widehat{\Sigma}}^{\left(j\right)}\right)}_{k}^{-1}{\mathit{X}}_{i}{e}_{i}-\frac{1}{\sqrt{n}}\sum _{i\in {L}_{j+1}}{\left({\widehat{\Sigma}}^{(j+1)}\right)}_{k}^{-1}{\mathit{X}}_{i}{e}_{i}.$$

The proof is similar to that of Theorem 3.7 and is omitted.

In this section, we consider three specific classes of RKHS with different
smoothness, characterized by the decaying rate of the eigenvalues: finite rank,
exponential decay and polynomial decay. In particular, we give explicit upper bounds
for *s* under which the combined estimate enjoys the oracle property,
and also explicit lower bounds for obtaining efficiency boosting studied in Section
3.5. Interestingly, we find that the upper bound for *s* increases
for RKHS with faster decaying eigenvalues. Hence, our aggregation procedure favors
smoother regression functions in the sense that more sub-populations are allowed to
be included in the observations. The choice of *λ* is also
explicitly characterized in terms of the entire sample size and the decaying rate of
eigenvalues. In all three examples, the undersmoothing is implicitly assumed for
removing the nonparametric estimation bias. Our bounds on *s* and
*λ* here are not the most general ones. Rather, we present
the bounds that have less complicated forms but are still sufficient to deliver
theoretical insights.

The RKHS with finite rank kernels includes linear functions, polynomial
functions, and, more generally, functional classes with finite dictionaries. In
this case, the effective dimension is simply proportional to the rank
*r*. Hence, $h\asymp {r}^{-1}$. Combining this fact with Theorem 3.4, we get the following
corollary for finite rank kernels:

Suppose Assumption 3.1 – 3.3 hold and *s*
→ ∞. For any ${z}_{0}\in \mathcal{Z}$, if *λ* =
*o*(*N*^{−1/2}),
log(*λ*^{−1}) =
*o*(*N*^{2}log^{−12}*N*)
and $s=o\left({\scriptstyle \frac{N}{\sqrt{\mathrm{log}\phantom{\rule{thinmathspace}{0ex}}{\lambda}^{-1}}\phantom{\rule{thinmathspace}{0ex}}{\mathrm{log}}^{6}\phantom{\rule{thinmathspace}{0ex}}N}}\right)$, then

$$\left(\begin{array}{c}\hfill \sqrt{n}({\widehat{\beta}}^{\left(j\right)}-{\beta}_{0}^{\left(j\right)})\hfill \\ \hfill \sqrt{N}(\stackrel{\u2012}{f}\left({z}_{0}\right)-{f}_{0}\left({z}_{0}\right))\hfill \end{array}\right)\u21ddN\left(0,{\sigma}^{2}\left(\begin{array}{cc}\hfill {\Omega}^{-1}\hfill & \hfill 0\hfill \\ \hfill 0\hfill & \hfill {\Sigma}_{22}^{\ast}\hfill \end{array}\right)\right),$$

where ${\Sigma}_{22}^{\ast}={\sum}_{\ell =1}^{r}{\varphi}_{\ell}{\left({z}_{0}\right)}^{2}+{\gamma}_{{z}_{0}}^{T}{\Omega}^{-1}{\gamma}_{{z}_{0}}$ and ${\gamma}_{{z}_{0}}={\sum}_{\ell =1}^{r}\langle \mathit{B},{\varphi}_{\ell}\rangle {}_{{L}_{2}\left({\mathbb{P}}_{Z}\right)}{\varphi}_{\ell}\left({z}_{0}\right)$.

From the above Corollary, we can easily tell that the upper bound
for *s* can be as large as
*o*(*N* log^{−7}
*N*) by choosing a sufficiently large
*λ*. Hence, *s* can be chosen
nearly as large as *N*. As for the lower bound of
*s* for boosting the efficiency, we have $s\gtrsim {r}^{2}\phantom{\rule{thinmathspace}{0ex}}{\mathrm{log}}^{4}\phantom{\rule{thinmathspace}{0ex}}N$ by plugging $h\asymp {r}^{-1}$ into (3.16). This lower bound is clearly smaller than the upper
bound. Hence, the efficiency boosting is feasible.

Corollary 4.2 below specifies conditions and *s* and
*λ* under which
achieves the nonparametric minimaxity.

Suppose that Assumption 3.1 - 3.3 hold. When *λ* =
*r/N* and *s* =
*o*(*N* log^{−5}
*N*), we have

$$\mathbb{E}\left[{\Vert \stackrel{\u2012}{f}-{f}_{0}\Vert}_{{L}_{2}\left({\mathbb{P}}_{Z}\right)}^{2}\right]\le Cr\u2215N,$$

for some constant *C*.

We next consider the RKHS for which the kernel has exponentially
decaying eigenvalues, i.e., ${\mu}_{\ell}=\mathrm{exp}(-\alpha {\ell}^{p})$ for some *α* > 0. In this case,
we have $h\asymp {\left(\mathrm{log}\phantom{\rule{thinmathspace}{0ex}}{\lambda}^{-1}\right)}^{-1\u2215p}$ by explicit calculations.

Suppose Assumption 3.1 – 3.3 hold, and for any ${z}_{0}\in \mathcal{Z},\phantom{\rule{thickmathspace}{0ex}}{f}_{0}\in \mathcal{H}$ satisfies ${\sum}_{\ell =1}^{\infty}\mid {\varphi}_{\ell}\left({z}_{0}\right){\langle {f}_{0},{\varphi}_{\ell}\rangle}_{\mathcal{H}}\mid <\infty $. If $\lambda =o({N}^{-1\u22152}\phantom{\rule{thinmathspace}{0ex}}{\mathrm{log}}^{1\u2215\left(2p\right)}N\wedge {n}^{-1\u22152})$, $\mathrm{log}\left({\lambda}^{-1}\right)=o\left({N}^{p\u2215(p+4)}{\mathrm{log}}^{-6p\u2215(p+4)}N\right)$ and $s=o\left({\scriptstyle \frac{N}{{\mathrm{log}}^{6}\phantom{\rule{thinmathspace}{0ex}}N\phantom{\rule{thinmathspace}{0ex}}{\mathrm{log}}^{(p+4)\u2215p}{\lambda}^{-1}}}\right)$, then

$$\left(\begin{array}{c}\hfill \sqrt{n}({\widehat{\beta}}^{\left(j\right)}-{\beta}_{0}^{\left(j\right)})\hfill \\ \hfill \sqrt{Nh}(\stackrel{\u2012}{f}\left({z}_{0}\right)-{f}_{0}\left({z}_{0}\right)\hfill \end{array}\right)\u21ddN\left(0,{\sigma}^{2}\left(\begin{array}{cc}\hfill {\Omega}^{-1}\hfill & \hfill 0\hfill \\ \hfill 0\hfill & \hfill {\sigma}_{{z}_{0}}^{2}\hfill \end{array}\right)\right),$$

where ${\sigma}_{{z}_{0}}^{2}={\mathrm{lim}}_{N\to \infty}h{\sum}_{\ell =1}^{\infty}{\scriptstyle \frac{{\varphi}_{\ell}^{2}\left({z}_{0}\right)}{{(1+\lambda \u2215{\mu}_{\ell})}^{2}}}$.

Corollary 4.3 implies the shrinking rate of the confidence interval
for *f*_{0}(*z*_{0}) as
(*Nh*)^{−1/2}. This motivates us to choose
*λ* (equivalently *h*) as large as
possible. Plugging such a *λ* into the upper bound of
*s* yields $s=o\left(N\phantom{\rule{thinmathspace}{0ex}}{\mathrm{log}}^{-(7p+4)\u2215p}N\right)$. For example, when *p* =
1(*p* = 2), the upper bound is $s=o\left(N\phantom{\rule{thinmathspace}{0ex}}{\mathrm{log}}^{-11}\phantom{\rule{thinmathspace}{0ex}}N\right)(s=o\left(N\phantom{\rule{thinmathspace}{0ex}}{\mathrm{log}}^{-9}\phantom{\rule{thinmathspace}{0ex}}N\right))$. Note that this upper bound for *s* only
differs from that for the finite rank kernel up to some logrithmic term.
This is mainly because RKHS with exponentially decaying eigenvalues has an
effective dimension (log *N*)^{1/p}
(for the above *λ*). Again, by (3.16) we get the lower bound
of $s\gtrsim {\left(\mathrm{log}\phantom{\rule{thinmathspace}{0ex}}{\lambda}^{-1}\right)}^{2\u2215p}\phantom{\rule{thinmathspace}{0ex}}{\mathrm{log}}^{2}$ When $\lambda \asymp {N}^{-1\u22152}\phantom{\rule{thinmathspace}{0ex}}{\mathrm{log}}^{1\u2215\left(2p\right)}N\wedge {n}^{-1\u22152}$, it is approximately $s\gtrsim {\mathrm{log}}^{(4p+2)\u2215p}N$.

As a concrete example, we consider the Gaussian kernel
*K*(*z*_{1},
*z*_{2}) = exp
(–|*z*_{1} –
*z*_{2}|^{2}/2). The eigenfunctions are
given in (2.1), and the
eigenvalues are exponentially decaying, as ${\mu}_{\ell}={\eta}^{2\ell +1}$, where η = √5–1)/2. According to
Krasikov (2004), we can get that

$${c}_{\varphi}=\underset{\ell \in \mathbb{N}}{\mathrm{sup}}{\Vert {\varphi}_{\ell}\Vert}_{\mathrm{sup}}\le \frac{2{e}^{15\u22158}{(\sqrt{5}\u22154)}^{1\u22152}}{3\sqrt{2\pi}{2}^{1\u22156}}\le 1.336.$$

Thus, Assumption 3.2 is satisfied. We next give an upper bound of ${\sigma}_{{z}_{0}}^{2}$ in Corollary 4.3 as follows:

$${\sigma}_{{z}_{0}}^{2}\le \underset{N\to \infty}{\mathrm{lim}}{\sigma}^{2}{c}_{\varphi}^{2}h\sum _{\ell =0}^{\infty}{(1+\lambda \eta \phantom{\rule{thinmathspace}{0ex}}\mathrm{exp}(-2\left(\mathrm{log}\phantom{\rule{thinmathspace}{0ex}}\eta \right)\ell ))}^{-2}={c}_{\varphi}^{2}\cdot 2{\sigma}^{2}\phantom{\rule{thinmathspace}{0ex}}\mathrm{log}(1\u2215\eta )\le 1.7178{\sigma}^{2},$$

where equality follows from Lemma C.1 in Appendix C with the
case *t* = 2. Hence, a (conservative)
100(1–*α*)% confidence interval for
*f*_{0}(*z*_{0}) is given
by (*z*_{0}) ±
1.3106*σz*_{α/2}/√*Nh*.

Suppose that Assumption 3.1 – 3.3 hold. By choosing
*λ* = log^{1/p}
*N/N* and *s* =
*o*(*N*
log^{–(5p+3)/p}
*N*), we have

$$\mathbb{E}\left[{\Vert \stackrel{\u2012}{f}-{f}_{0}\Vert}_{{L}_{2}\left({\mathbb{P}}_{Z}\right)}^{2}\right]\le C{\left(\mathrm{log}\phantom{\rule{thinmathspace}{0ex}}N\right)}^{1\u2215p}\u2215N.$$

We know that the above rate is minimax optimal according to Zhang et al. (2013). Note that the upper
bound for *s* required here is similar as that for obtaining the
joint limiting distribution in Corollary 4.3.

We now consider the RKHS for which the kernel has polynomially decaying
eigenvalues, i.e., ${\mu}_{\ell}=c{\ell}^{-2\nu}$ for some *ν* > 1/2. Hence, we can
explicitly calculate that *h* =
*λ*^{1/(2ν)}. The
resulting penalized estimate is called as “partial smoothing
spline” in the statistics literature; see Gu (2013); Wang (2011).

Suppose Assumption 3.1 – 3.3 hold, and ${\sum}_{\ell =1}^{\infty}\mid {\varphi}_{\ell}\left({z}_{0}\right){\langle {f}_{0},{\varphi}_{\ell}\rangle}_{\mathcal{H}}\mid <\infty $ for any ${z}_{0}\in \mathcal{Z}$ and ${f}_{0}\in \mathcal{H}$. For any $\nu >1+\sqrt{3}\u22152\approx 1.866,\phantom{\rule{thickmathspace}{0ex}}\text{if}\phantom{\rule{thickmathspace}{0ex}}\lambda \asymp {N}^{-d}$ for some ${\scriptstyle \frac{2\nu}{4\nu +1}}<d<{\scriptstyle \frac{4{\nu}^{2}}{10\nu -1}},\phantom{\rule{thickmathspace}{0ex}}\lambda =o\left({n}^{-1\u22152}\right)$ and $s=o\left({\lambda}^{{\scriptstyle \frac{10\nu -1}{4{\nu}^{2}}}}N\phantom{\rule{thinmathspace}{0ex}}{\mathrm{log}}^{-6}\phantom{\rule{thinmathspace}{0ex}}N\right)$, then

$$\left(\begin{array}{c}\hfill \sqrt{n}({\widehat{\beta}}^{\left(j\right)}-{\beta}_{0}^{\left(j\right)})\hfill \\ \hfill \sqrt{Nh}(\stackrel{\u2012}{f}\left({z}_{0}\right)-{f}_{0}\left({z}_{0}\right))\hfill \end{array}\right)\u21ddN\left(0,{\sigma}^{2}\left(\begin{array}{cc}\hfill {\Omega}^{-1}\hfill & \hfill 0\hfill \\ \hfill 0\hfill & \hfill {\sigma}_{{z}_{0}}^{2}\hfill \end{array}\right)\right).$$

where ${\sigma}_{{z}_{0}}^{2}={\mathrm{lim}}_{N\to \infty}\phantom{\rule{thinmathspace}{0ex}}h{\sum}_{\ell =1}^{\infty}{\scriptstyle \frac{{\varphi}_{\ell}^{2}\left({z}_{0}\right)}{{(1+\lambda \u2215{\mu}_{\ell})}^{2}}}$.

Similarly, we choose $\lambda \asymp {N}^{-{\scriptstyle \frac{2\nu}{4\nu +1}}}\wedge {n}^{-1\u22152}$ to get the fastest shrinking rate of the confidence
interval. Plugging the above *λ* into the upper bound
for *s*, we get

$$s=o\left({N}^{\frac{8{\nu}^{2}-8\nu +1}{2\nu (4\nu +1)}}{\mathrm{log}}^{-6}\phantom{\rule{thinmathspace}{0ex}}N\wedge N{\left(\mathrm{log}\phantom{\rule{thinmathspace}{0ex}}N\right)}^{-\frac{48{\nu}^{2}}{8{\nu}^{2}+10\nu +1}}\right).$$

When *N* is large, the above bound reduces to $s=o\left({N}^{{\scriptstyle \frac{8{\nu}^{2}-8\nu +1}{2\nu (4\nu +1)}}}{\mathrm{log}}^{-6}N\right)$. We notice that the upper bound for *s*
increases as *ν* increases, indicating that the
aggregation procedure favors smoother functions. As an example, for the case
that *ν* = 2, we have the upper bound for
*s* =
*o*(*N*^{17/36}
log^{−6}
*N*) ≈
*o*(*N*^{0.47}
log^{−6}
*N*). Again, we obtain the lower bound $s\gtrsim {\lambda}^{-1\u2215\nu}\phantom{\rule{thinmathspace}{0ex}}{\mathrm{log}}^{4}N$ by plugging $h\asymp {\lambda}^{{\scriptstyle \frac{1}{2\nu}}}$ into (3.16). When $\lambda \asymp {N}^{-{\scriptstyle \frac{2\nu}{4\nu +2}}}$, we get $s\gtrsim {N}^{{\scriptstyle \frac{1}{4\nu +1}}}{\mathrm{log}}^{2}N$. For *ν* = 2, this is approximately $s\gtrsim {N}^{0.22}\phantom{\rule{thinmathspace}{0ex}}{\mathrm{log}}^{4}\phantom{\rule{thinmathspace}{0ex}}N$.

As a concrete example, we consider the periodic Sobolev space ${H}_{0}^{\nu}[0,1]$ with the following eigenfunctions:

$${\varphi}_{\ell}\left(x\right)=\{\begin{array}{cc}\hfill 1,\hfill & \ell =0,\hfill \\ \hfill \sqrt{2}\phantom{\rule{thinmathspace}{0ex}}\mathrm{cos}\left(\ell \pi x\right),\hfill & \ell =2k\phantom{\rule{thickmathspace}{0ex}}\text{for}\phantom{\rule{thickmathspace}{0ex}}k=1,2,\dots ,\hfill \\ \hfill \sqrt{2}\phantom{\rule{thinmathspace}{0ex}}\mathrm{sin}\left((\ell +1)\pi x\right),\hfill & \ell =2k-1\phantom{\rule{thickmathspace}{0ex}}\text{for}\phantom{\rule{thickmathspace}{0ex}}k=1,2,\dots ,\hfill \end{array}\phantom{\}}$$

(4.1)

and eigenvalues

$${\mu}_{\ell}=\{\begin{array}{cc}\hfill \infty ,\hfill & \ell =0,\hfill \\ \hfill {\left(\ell \pi \right)}^{-2\nu},\hfill & \ell =2k\phantom{\rule{thickmathspace}{0ex}}\text{for}\phantom{\rule{thickmathspace}{0ex}}k=1,2,\dots ,\hfill \\ \hfill {\left((\ell +1)\pi \right)}^{-2\nu},\hfill & \ell =2k=1\phantom{\rule{thickmathspace}{0ex}}\text{for}\phantom{\rule{thickmathspace}{0ex}}k=1,2,\dots ,\hfill \end{array}\phantom{\}}$$

(4.2)

Hence, Assumption 3.2 trivially holds. Under the above eigensystem, the following lemma gives an explicit expression of ${\sigma}_{{z}_{0}}^{2}$.

Under the eigen-system defined by (4.1) and (4.2), we can explicitly calculate:

$${\sigma}_{{z}_{0}}^{2}=\underset{N\to \infty}{\mathrm{lim}}h\sum _{\ell =1}^{\infty}\frac{{\varphi}_{\ell}^{2}\left({z}_{0}\right)}{{(1+\lambda \u2215{\mu}_{\ell})}^{2}}={\int}_{0}^{\infty}\frac{1}{{(1+{x}^{2\nu})}^{2}}dx=\frac{\pi}{2\nu \phantom{\rule{thinmathspace}{0ex}}\mathrm{sin}(\pi \u2215\left(2\nu \right))}.$$

Therefore, by Corollary 4.5, we have that when $\lambda \asymp {N}^{-{\scriptstyle \frac{2\nu}{4\nu +1}}}$ and $s=o\left({N}^{{\scriptstyle \frac{8{\nu}^{2}-8\nu +1}{2\nu (4\nu +1)}}}\phantom{\rule{thinmathspace}{0ex}}{\mathrm{log}}^{-6}\phantom{\rule{thinmathspace}{0ex}}N\right)$,

$$\left(\begin{array}{c}\hfill \sqrt{n}({\widehat{\beta}}^{\left(j\right)}-{\beta}_{0}^{\left(j\right)})\hfill \\ \hfill \sqrt{Nh}(\stackrel{\u2012}{f}\left({z}_{0}\right)-{f}_{0}\left({z}_{0}\right))\hfill \end{array}\right)\u21ddN\left(0,{\sigma}^{2}\left(\begin{array}{cc}\hfill {\Omega}^{-1}\hfill & \hfill 0\hfill \\ \hfill 0\hfill & \hfill {\sigma}_{{z}_{0}}^{2}\hfill \end{array}\right)\right).$$

(4.3)

where ${\sigma}_{{z}_{0}}^{2}$ is given in Lemma 4.1. When $\nu =2,\phantom{\rule{thickmathspace}{0ex}}\lambda \asymp {N}^{-4\u22159}$ and the upper bound for *s* =
*o*(*N*^{17/36}
log^{−6}
*N*).

Suppose that Assumption 3.1 - 3.3 hold. If we choose $\lambda ={N}^{-{\scriptstyle \frac{2\nu}{2\nu +1}}}$, and $s=o\left({N}^{{\scriptstyle \frac{4{\nu}^{2}-4\nu +1}{4{\nu}^{2}+2\nu}}}\phantom{\rule{thinmathspace}{0ex}}{\mathrm{log}}^{-4}\phantom{\rule{thinmathspace}{0ex}}N\right)$, the combined estimator achieves optimal rate of convergence, i.e.,

$$\mathbb{E}\left[{\Vert \stackrel{\u2012}{f}-{f}_{0}\Vert}_{{L}_{2}\left({\mathbb{P}}_{Z}\right)}^{2}\right]\le C{N}^{-\frac{2\nu}{2\nu +1}}.$$

(4.4)

The above rate is known to be minimax optimal for the class of functions in consideration (Stone, 1985).

In this section, we apply the divide-and-conquer approach, which is commonly used to deal with massive homogeneous data, to some sub-populations that have huge sample sizes. A general goal of this section is to explore the most computationally efficient way to split the sample in those sub-populations while preserving the best possible statistical inference. Specifically, we want to derive the largest possible number of splits under which the averaged estimators for both components enjoy the same statistical performances as the “oracle” estimator that is computed based on the entire sample. Without loss of generality, we assume the entire sample to be homogeneous by setting all ${\beta}_{0}^{\left(j\right)}$'s to be equal throughout this section.

The divide-and-conquer method *randomly* splits the massive
data into *s* mutually exclusive subsamples. For simplicity, we
assume all the subsamples share the same sample size, denoted as *n*.
Hence, *N* = *n × s*. With a bit abuse of
notation, we define the divide-and-conquer estimators as ${\widehat{\beta}}^{\left(j\right)}$ and ${\widehat{f}}^{\left(j\right)}$ when they are based on the *j*-th subsample. Thus,
the averaged estimator is defined as

$$\stackrel{\u2012}{\beta}=(1\u2215s)\sum _{j=1}^{s}{\widehat{\beta}}^{\left(j\right)}\phantom{\rule{1em}{0ex}}\text{and}\phantom{\rule{1em}{0ex}}\stackrel{\u2012}{f}(\cdot )=(1\u2215s)\sum _{j=1}^{s}{\widehat{f}}^{\left(j\right)}(\cdot ).$$

Comparing to the oracle estimator, the aggregation procedure reduces
the computational complexity in terms of the entire sample size *N*
to the sub-sample size *N/s*. In the case of kernel ridge regression,
the complexity is *O*(*N*^{3}), while our
aggregation procedure (run in one single machine) reduces it to
*O*(*N*^{3}*/s*^{2}).
Propositions 5.1 and 5.2 below state conditions under which the divide-and-conquer
estimators maintain the same statistical properties as oracle estimate, i.e.,
so-called oracle property.

Our first contribution is a non-asymptotic upper bound for MSE().

Suppose that the conditions in Theorem 3.1 hold. We have that the divide and conquer estimator satisfies

$$\mathrm{MSE}\left(\stackrel{\u2012}{f}\right)\le {C}_{1}{\sigma}^{2}{\left(Nh\right)}^{-1}+2{\Vert {f}_{0}\Vert}_{\mathcal{H}}^{2}\lambda +{C}_{2}{\lambda}^{2}+{s}^{-1}a(n,s,h,\lambda ,\omega ),$$

(5.1)

where *a*(*n, s, h, λ,
ω*), *C*_{1} and
*C*_{2} are constants defined in Theorem 3.1.

Our second contribution is on the joint asymptotic distribution under
the same conditions for (*s, λ*) required in the
heterogeneous data setting.

Suppose that the conditions in Theorem 3.4 hold. If we choose
*λ* =
*o*(*N*^{−1/2}), then

$$\left(\begin{array}{c}\hfill \sqrt{N}(\stackrel{\u2012}{\beta}-{\beta}_{0})\hfill \\ \hfill \sqrt{Nh}(\stackrel{\u2012}{f}\left({z}_{0}\right)-{f}_{0}\left({z}_{0}\right)-{W}_{\lambda}{f}_{0}\left({z}_{0}\right))\hfill \end{array}\right)\u21ddN\left(0,\left(\begin{array}{cc}\hfill {\sigma}^{2}{\Omega}^{-2}\hfill & \hfill {\Sigma}_{12}^{\ast}\hfill \\ \hfill {\Sigma}_{21}^{\ast}\hfill & \hfill {\Sigma}_{22}^{\ast}\hfill \end{array}\right)\right),$$

where ${\Sigma}_{12}^{\ast}={\Sigma}_{21}^{\ast T}={\sigma}^{2}{\Omega}^{-1}{\gamma}_{{z}_{0}}$ and ${\Sigma}_{22}^{\ast}={\sigma}^{2}({\sigma}_{{z}_{0}}^{2}+{\gamma}_{{z}_{0}}^{T}{\Omega}^{-1}{\gamma}_{{z}_{0}})$. Moreover, if *h* → 0, then
*γ*_{z0} =
**0**. In this case, ${\Sigma}_{12}^{\ast}={\Sigma}_{21}^{\ast T}=0$ and ${\Sigma}_{22}^{\ast}={\sigma}^{2}{\sigma}_{{z}_{0}}^{2}$.

The conclusion of Proposition 5.2 holds no matter *s* is
fixed or diverges (once the condition for *s* in Theorem 3.4 are
satisfied).

In view of Propositions 5.1 and 5.2, we note that the above upper bound
and joint asymptotic distribution are exactly the same as those for the oracle
estimate, i.e., *s* = 1.

In this section, we empirically examine the impact of the number of sub-populations on the statistical inference built on $({\widehat{\beta}}^{\left(j\right)},\stackrel{\u2012}{f})$. As will be seen, the simulation results strongly support our general theory.

Specifically, we consider the partial smoothing spline models in Section
4.3. In the simulation setup, we let *ε* ~
*N*(0,1), *p* = 1 and *ν* =
2 (cubic spline). Moreover *Z* ~ Uniform (−1, 1) and
*X* = (*W + Z*)/2, where *W*
~ Unfiform(−1,1), such that *X* and *Z*
are dependent. It is easy to show that Ω = *E*[(X –
*E*[*X*|*Z*])^{2})] = 1/12
and Σ = *E*[*X*^{2}] = 1/6. To design
that heterogeneous data setting, we let ${\beta}_{0}^{\left(j\right)}=j$ for *j* = 1,2,...,*s* on the
*j*-th subpopulation. The nonparametric function
*f*_{0}(*z*), which is common across all
subpopulations, is assumed to be
0.6*b*_{30,17}(*z*) +
0.4*b*_{3,11}, where
*b*_{α1,α2}
is the density function for
*Beta*(*α*_{1},
*α*_{2}).

We start from the 95% predictive interval (at
(*x*_{0}, *z*_{0})) implied by the
joint asymptotic distribution (4.3):

$$\left[{\widehat{Y}}^{\left(j\right)}\pm 1.96\sigma \sqrt{{x}_{0}^{T}{\Omega}^{-1}{x}_{0}\u2215n+{\sigma}_{{z}_{0}}^{2}\u2215\left(Nh\right)+1}\right],$$

where ${\widehat{Y}}^{\left(j\right)}-{x}_{0}^{T}{\widehat{\beta}}^{\left(j\right)}+\stackrel{\u2012}{f}\left({z}_{0}\right)$ is the predicted response. The unknown error variance
*σ* is estimated by ${\left({\widehat{\sigma}}^{\left(j\right)}\right)}^{2}={n}^{-1}{\sum}_{i\in {L}_{j}}{({Y}_{i}-{X}_{i}^{T}{\widehat{\beta}}^{\left(j\right)}-{\widehat{f}}^{\left(j\right)}\left({Z}_{i}\right))}^{2}\u2215(n-Tr\left(A\left(\lambda \right)\right))$, where *A*(*λ*) denotes the
smoothing matrix, followed by an aggregation ${\stackrel{\u2012}{\sigma}}^{2}=1\u2215s{\sum}_{j=1}^{s}{\left({\widehat{\sigma}}^{\left(j\right)}\right)}^{2}$. In the simulations, we fix *x*_{0} = 0.5
and choose *z*_{0} = 0.25, 0.5, 0.75 and 0.95. The coverage
probability is calculated based on 200 repetitions. As for *N* and
*s*, we set *N* = 256, 528, 1024, 2048, 4096, and
choose *s* = 2^{0},
2^{1},...,2^{t–3} when
*N* = 2^{t}. The simulation results are summarized in
Figure 1. We notice an interesting phase
transition from Figure 1: when
*s* ≤ *s** where *s**
≈ *N*^{0.45}, the coverage probability is
approximately 95%; when *s* ≥ *s**, the
coverage probability drastically decreaes. This empirical observation is strongly
supported by our theory developed in Section 4.3 where *s** ≈
*N*^{0.42} log^{−6}*N* for
*ν* = 2.

We next compute the mean-squared errors of under
different choices of *N* and *s* in Figure 2. It is demonstrated that the increasing
trends of MSE as *s* increases are very similar for different
*N*. More importantly, all the MSE curves suddenly blow up when
*s* ≈ *N*^{0.4}. This is also close
to our theoretical result that the transition point is around
*N*^{0.45} log^{−6}
*N*.

We next empirically verify the efficiency boosting theory developed in Section 3.5. Based on ${\widehat{\beta}}^{\left(j\right)}$ and ${\stackrel{\u02c7}{\beta}}^{\left(j\right)}$, we construct the following two types of 95% confidence intervals for ${\beta}_{0}^{\left(j\right)}$.

$$\begin{array}{cc}\hfill {\mathrm{CI}}_{1}=& \left[{\widehat{\beta}}^{\left(j\right)}\pm 1.96{\Omega}^{-1\u22152}{n}^{-1.2}\stackrel{\u2012}{\sigma}\right],\hfill \\ \hfill {\mathrm{CI}}_{2}=& \left[{\stackrel{\u02c7}{\beta}}^{\left(j\right)}\pm 1.96{\Sigma}^{-1\u22152}{n}^{-1.2}\stackrel{\u2012}{\sigma}\right].\hfill \end{array}$$

Obviously, CI_{2} is shorter than CI_{1}. However,
Theorem 3.5 shows that CI_{2} is valid only when *s*
satisfies both a upper bound and a lower bound. This theoretical condition is
empirically verified in Figure 3 which exhibits
the validity range of CI_{2} in terms of *s*. In Figure 4, we further compare CI_{2} and
CI_{1} in terms of their coverage probabilities and lengths. This figure
shows that when *s* is in a proper range, the coverage probabilities
of CI_{1} and CI_{2} are similar, while CI_{2} is
significantly shorter.

Lastly, we consider the heterogeneity testing. In Figure 5, we compare tests **Ψ**_{1} and
**Ψ**_{2} under different choices of *N*
and *s* ≥ 2. Specifically, Figure 5 (i) compares the nominal levels, while Figure 5 (ii) - (iv) compare the powers under various
alternative hypotheses ${H}_{1}:{\beta}_{0}^{\left(j\right)}-{\beta}_{0}^{\left(k\right)}=\Delta $, where **Δ** = 0.5, 1, 1.5. It is clearly seen
that both tests are consistent, and their powers increase as **Δ**
or *N* increases. In addition, we observe that
**Ψ**_{2} has uniformly larger powers than
**Ψ**_{1}.

In this section, we present main proofs of Theorem 3.1, 3.3 and 3.4 in the main text.

We start from analyzing the minimization problem (3.2) on each sub-population.
Recall *m* =
(* β*,

$$\begin{array}{cc}\hfill \frac{1}{n}\sum _{i\in {L}_{j}}{({Y}_{i}-{\mathit{X}}_{i}^{T}\beta -f\left({Z}_{i}\right))}^{2}+\lambda {\Vert f\Vert}_{\mathcal{H}}^{2}=& \frac{1}{n}\sum _{i\in {L}_{j}}{({Y}_{i}-m\left({U}_{i}\right))}^{2}+{\langle {P}_{\lambda}m,m\rangle}_{\mathcal{A}}\hfill \\ \hfill =& \frac{1}{n}\sum _{i\in {L}_{j}}{({Y}_{i}-{\langle R{U}_{i},m\rangle}_{\mathcal{A}})}^{2}+{\langle {P}_{\lambda}m,m\rangle}_{\mathcal{A}}\hfill \end{array}$$

The first order optimality condition (w.r.t. Fréchet derivative) gives

$$\frac{1}{n}\sum _{i\in {L}_{j}}{R}_{{U}_{i}}({\widehat{m}}^{\left(j\right)}\left({U}_{i}\right)-{Y}_{i})+{P}_{\lambda}{\widehat{m}}^{\left(j\right)}=0,$$

where ${\widehat{m}}^{\left(j\right)}=({\widehat{\beta}}^{\left(j\right)},{\widehat{f}}^{\left(j\right)})$. This implies that

$$-\frac{1}{n}\sum _{i\in {L}_{j}}{R}_{{U}_{i}}{\epsilon}_{i}+\frac{1}{n}\sum _{i\in {L}_{j}}{R}_{{U}_{i}}({\widehat{m}}^{\left(j\right)}\left({U}_{i}\right)-{m}_{0}^{\left(j\right)}\left({U}_{i}\right))+{P}_{\lambda}{\widehat{m}}^{\left(j\right)}=0,$$

where ${m}_{0}^{\left(j\right)}=({\beta}_{0}^{\left(j\right)},{f}_{0})$. Define $\Delta {m}^{\left(j\right)}\u2254{\widehat{m}}^{\left(j\right)}-{m}_{0}^{\left(j\right)}$. Adding ${\mathbb{E}}_{U}\left[{R}_{U}\Delta {m}^{\left(j\right)}\left(U\right)\right]$ on both sides of the above equation, we have

$${\mathbb{E}}_{U}\left[{R}_{U}\Delta {m}^{\left(j\right)}\left(U\right)\right]+{P}_{\lambda}\Delta {m}^{\left(j\right)}=\frac{1}{n}\sum _{i\in {L}_{j}}{R}_{{U}_{i}}{\epsilon}_{i}-{P}_{\lambda}{m}_{0}^{\left(j\right)}-\frac{1}{n}\sum _{i\in {L}_{j}}({R}_{{U}_{i}}\Delta {m}^{\left(j\right)}\left({U}_{i}\right)-{\mathbb{E}}_{U}\left[{R}_{U}\Delta {m}^{\left(j\right)}\left(U\right)\right]).$$

(7.1)

The L.H.S. of (7.1) can be rewritten as

$$\begin{array}{cc}\hfill {\mathbb{E}}_{U}\left[{R}_{U}\Delta {m}^{\left(j\right)}\left(U\right)\right]+{P}_{\lambda}\Delta {m}^{\left(j\right)}=& {\mathbb{E}}_{U}\left[{R}_{U}{\langle {R}_{U},\Delta {m}^{\left(j\right)}\rangle}_{\mathcal{A}}\right]+{P}_{\lambda}\Delta {m}^{\left(j\right)}\hfill \\ \hfill =& ({\mathbb{E}}_{U}[{R}_{U}\otimes {R}_{U}]+{P}_{\lambda})\Delta {m}^{\left(j\right)}\hfill \\ \hfill =& \Delta {m}^{\left(j\right)},\hfill \end{array}$$

where the last equality follows from proposition 2.2. Then (7.1) becomes

$${\widehat{m}}^{\left(j\right)}-{m}_{0}^{\left(j\right)}=\frac{1}{n}\sum _{i\in {L}_{j}}{R}_{{U}_{i}}{\epsilon}_{i}-{P}_{\lambda}{m}_{0}^{\left(j\right)}-\frac{1}{n}\sum _{i\in {L}_{j}}\left({R}_{{U}_{i}}\Delta {m}^{\left(j\right)}\left({U}_{i}\right)-{\mathbb{E}}_{U}\left[{R}_{U}\Delta {m}^{\left(j\right)}\left(U\right)\right]\right).$$

(7.2)

We will show that the first term in the R.H.S. of (7.2) weakly converges to a
normal distribution, the second term contributes to the estimation bias, and
that the last term is an asymptotically ignorable remainder term. We denote
the last term as $Re{m}^{\left(j\right)}\u2254{\scriptstyle \frac{1}{n}}{\sum}_{i\in {L}_{j}}({R}_{{U}_{i}}\Delta {m}^{\left(j\right)}\left({U}_{i}\right)-{\mathbb{E}}_{U}\left[{R}_{U}\Delta {m}^{\left(j\right)}\left(U\right)\right])$. Recall that *R _{u}* =
(

$$\begin{array}{cc}\hfill Re{m}_{\beta}^{\left(j\right)}\u2254& \frac{1}{n}\sum _{i\in {L}_{j}}\left({L}_{{U}_{i}}\Delta {m}^{\left(j\right)}\left({U}_{i}\right)-{\mathbb{E}}_{U}\left[{L}_{U}\Delta {m}^{\left(j\right)}\left(U\right)\right]\right)\hfill \\ \hfill Re{m}_{f}^{\left(j\right)}:=& \frac{1}{n}\sum _{i\in {L}_{j}}\left(N{U}_{i}\Delta {m}^{\left(j\right)}\left({U}_{i}\right)-{\mathbb{E}}_{U}\left[{N}_{U}\Delta {m}^{\left(j\right)}\left(U\right)\right]\right).\hfill \end{array}$$

Similarly, (7.2) can be rewritten into the following two equations:

$${\widehat{\beta}}^{\left(j\right)}-{\beta}_{0}^{\left(j\right)}=\frac{1}{n}\sum _{i\in {L}_{j}}{L}_{{U}_{i}}{\epsilon}_{i}-{L}_{\lambda}{f}_{0}-Re{m}_{\beta}^{\left(j\right)},$$

(7.3)

and

$${\widehat{f}}^{\left(j\right)}-{f}_{0}=\frac{1}{n}\sum _{i\in {L}_{j}}{N}_{{U}_{i}}{\epsilon}_{i}-{N}_{\lambda}{f}_{0}-Re{m}_{f}^{\left(j\right)},$$

(7.4)

for all *j* = 1,...,*s*.
Taking average of (7.4) for
all *j* over *s*, and by definition of
, we have

$$\stackrel{\u2012}{f}-{f}_{0}=\frac{1}{N}\sum _{i=1}^{N}{N}_{{U}_{i}}{\epsilon}_{i}-{N}_{\lambda}{f}_{0}-\frac{1}{s}\sum _{j=1}^{s}Re{m}_{f}^{\left(j\right)},$$

(7.5)

where we used $1\u2215N{\sum}_{i=1}^{N}{N}_{{U}_{i}}{\epsilon}_{i}=1\u2215s{\sum}_{j=1}^{s}1\u2215n{\sum}_{i\in {L}_{j}}{N}_{{U}_{i}}{\epsilon}_{i}$. By (7.5), it follows that

$$\mathbb{E}\left[{\Vert \stackrel{\u2012}{f}-{f}_{0}\Vert}_{\mathcal{C}}^{2}\right]\le 3\mathbb{E}\left[{\Vert \frac{1}{N}\sum _{i=1}^{N}{N}_{{U}_{i}}{\epsilon}_{i}\Vert}_{\mathcal{C}}^{2}\right]+3\Vert {N}_{\lambda}{f}_{0}\Vert {}_{\mathcal{C}}^{2}+3\mathbb{E}\left[{\Vert \frac{1}{s}\sum _{j=1}^{s}Re{m}_{f}^{\left(j\right)}\Vert}_{\mathcal{C}}^{2}\right].$$

(7.6)

By Lemma A.5 and the fact that each
*N _{Ui}ε_{i}*
is i.i.d., it follows that

$$\mathbb{E}\left[{\Vert \frac{1}{N}\sum _{i=1}^{N}{N}_{{U}_{i}}{\epsilon}_{i}\Vert}_{\mathcal{C}}^{2}\right]=\frac{1}{N}\mathbb{E}\left[{\Vert {N}_{U}\epsilon \Vert}_{\mathcal{C}}^{2}\right]\le {C}_{1}{\sigma}^{2}\frac{1}{Nh},$$

(7.7)

and

$${\Vert {N}_{\lambda}{f}_{0}\Vert}_{\mathcal{C}}^{2}\le 2{\Vert {f}_{0}\Vert}_{\mathcal{H}}^{2}\lambda +{C}_{2}{\lambda}^{2},$$

(7.8)

where *C*_{1} and
*C*_{2} are constants specified in Lemma A.5.

As for the third term in (7.6), we have by independence across sub-populations that

$$\mathbb{E}\left[{\Vert \frac{1}{s}\sum _{j=1}^{s}Re{m}_{f}^{\left(j\right)}\Vert}_{\mathcal{C}}^{2}\right]=\frac{1}{{s}^{2}}\sum _{j=1}^{s}\mathbb{E}\left[{\Vert Re{m}_{f}^{\left(j\right)}\Vert}_{\mathcal{C}}^{2}\right].$$

(7.9)

Therefore it suffices to bound $\mathbb{E}\left[{\Vert Re{m}_{f}^{\left(j\right)}\Vert}_{c}^{2}\right]$. We have the following lemma that controls this term:

Suppose Assumptions 3.1, 3.2 and Condition (3.6) hold. We have for all
*j* = 1,...,*s*

$$\mathbb{E}\left[{\Vert Re{m}^{\left(j\right)}\Vert}_{\mathcal{A}}^{2}\right]\le a(n,s,h,\lambda ,\omega ),$$

for sufficiently large *n*. Moreover, the
inequality also holds for $\mathbb{E}\left[{\Vert Re{m}_{\beta}^{\left(j\right)}\Vert}_{2}^{2}\right]$ and $\mathbb{E}\left[{\Vert Re{m}_{f}^{\left(j\right)}\Vert}_{\mathcal{C}}^{2}\right]$

Combining (7.6) - (7.9) and Lemma 7.1, and by the fact that ${\Vert \stackrel{\u2012}{f}-{f}_{0}\Vert}_{{L}_{2}\left({\mathbb{P}}_{Z}\right)}^{2}\le {\Vert \stackrel{\u2012}{f}-{f}_{0}\Vert}_{\mathcal{C}}^{2}$, we complete the proof of Theorem 3.1.

Recall that ${m}_{0}^{\left(j\right)\ast}=({\beta}_{0}^{\left(j\right)\ast},{f}_{0}^{\ast})=(id-{P}_{\lambda}){m}_{0}^{\left(j\right)}$ where ${m}_{0}^{\left(j\right)}=({\beta}_{0}^{\left(j\right)},{f}_{0})$. This implies that ${\beta}_{0}^{\left(j\right)\ast}={\beta}_{0}^{\left(j\right)}-{L}_{\lambda}{f}_{0}$ and (7.5),
for arbitrary ** x** and

$$\begin{array}{cc}\hfill ({\mathit{x}}^{T},1)\left(\begin{array}{c}\hfill \sqrt{n}({\widehat{\beta}}^{\left(j\right)}-{\beta}_{0}^{\left(j\right)\ast})\hfill \\ \hfill \sqrt{Nh}(\stackrel{\u2012}{f}\left({z}_{0}\right)-{f}_{0}^{\ast}\left({z}_{0}\right))\hfill \end{array}\right)=& \sqrt{n}{\mathit{x}}^{T}({\widehat{\beta}}^{\left(j\right)}-{\beta}_{0}^{\left(j\right)\ast})+\sqrt{Nh}({\stackrel{\u2012}{f}}_{N,\lambda}\left({z}_{0}\right)-{f}_{0}^{\ast}\left({z}_{0}\right))\hfill \\ \hfill =& \underset{\left(I\right)}{\underbrace{\frac{1}{\sqrt{n}}\sum _{i\in {L}_{j}}{\mathit{x}}^{T}{L}_{{U}_{i}}{\epsilon}_{i}+\frac{1}{\sqrt{N}}\sum _{i=1}^{N}{h}^{1\u22152}{N}_{{U}_{i}}\left({z}_{0}\right){\epsilon}_{i}}}+\underset{\left(II\right)}{\underbrace{\sqrt{n}{\mathit{x}}^{T}Re{m}_{\beta}^{\left(j\right)}+\sqrt{Nh}{s}^{-1}}\sum _{j=1}^{s}Re{m}_{f}^{\left(j\right)}\left({z}_{0}\right)}.\hfill \end{array}$$

In what follows, we will show that the main term (I) is
asymptotically normal and the remainder term (II) is of order
*o _{P}*(1). Given that

We present that result for showing asymptotic normality of (I) in the following lemma and defer its proof to supplemental material.

Suppose that as $N\to \infty ,h{\Vert {\stackrel{~}{K}}_{{z}_{0}}\Vert}_{{L}_{2}\left({\mathbb{P}}_{Z}\right)}^{2}\to {\sigma}_{{z}_{0}}^{2}$, ${h}^{1\u22152}\left({W}_{\lambda}\mathit{A}\right)\left({z}_{0}\right)\to {a}_{{z}_{0}}\in {\mathbb{R}}^{p}$ and ${h}^{1\u22152}\mathit{A}\left({z}_{0}\right)\to -{\gamma}_{{z}_{0}}\in {\mathbb{R}}^{p}$. We have

- (i)if
*s*→ ∞, then$$\left(I\right)\u21ddN(0,{\sigma}^{2}({\mathit{x}}^{T}{\Omega}^{-1}\mathit{x}+{\Sigma}_{22})).$$(7.10) - (ii) if
*s*is fixed, then$$\left(I\right)\u21ddN(0,{\sigma}^{2}({\mathit{x}}^{T}{\Omega}^{-1}\mathit{x}+{\Sigma}_{22}+2{s}^{-1\u22152}{\mathit{x}}^{T}{\Sigma}_{12})).$$(7.11)

We now turn to bound the remainder term (II). We need the following lemma:

Suppose Assumption 3.1, 3.2 and Condition (3.6) hold. We have the following two sets of results that control the remainder terms:

- (i)For all
*j*= 1,...,*s*where ${b}_{n,s}=C{h}^{-1}{n}^{-1\u22152}{r}_{n,s}(\omega (\mathcal{F},1)+\mathrm{log}\phantom{\rule{thinmathspace}{0ex}}N)$ and ${r}_{n,s}={\left(\mathrm{log}\phantom{\rule{thinmathspace}{0ex}}N\right)}^{2}{\left(nh\right)}^{-1\u22152}+{\lambda}^{1\u22152}$. Also, ${\Vert Re{m}_{\beta}^{\left(j\right)}\Vert}_{2}={o}_{P}\left({b}_{n,s}\right)$ and ${\Vert Re{m}_{f}^{\left(j\right)}\Vert}_{\mathcal{C}}={o}_{P}\left({b}_{n,s}\right)$.$${\Vert Re{m}^{\left(j\right)}\Vert}_{\mathcal{A}}={o}_{P}\left({b}_{n,s}\right).$$ - (ii)Moreover, we have$$\begin{array}{cc}\hfill {\Vert \frac{1}{s}\sum _{j=1}^{s}Re{m}^{\left(j\right)}\Vert}_{\mathcal{A}}=& {o}_{P}\left(s{w}^{-1\u22152}{b}_{n,s}\phantom{\rule{thinmathspace}{0ex}}\mathrm{log}\phantom{\rule{thinmathspace}{0ex}}N\right)\hfill \\ \phantom{\rule{0ex}{0ex}}\hfill {\Vert \frac{1}{s}\sum _{j=1}^{s}Re{m}_{\beta}^{\left(j\right)}\Vert}_{2}=& {o}_{P}\left({s}^{-1\u22152}{b}_{n,s}\phantom{\rule{thinmathspace}{0ex}}\mathrm{log}\phantom{\rule{thinmathspace}{0ex}}N\right)\hfill \\ \phantom{\rule{0ex}{0ex}}\hfill {\Vert \frac{1}{s}\sum _{j=1}^{s}Re{m}_{f}^{\left(j\right)}\Vert}_{\mathcal{C}}=& {o}_{P}\left({s}^{-1\u22152}{b}_{n,s}\phantom{\rule{thinmathspace}{0ex}}\mathrm{log}\phantom{\rule{thinmathspace}{0ex}}N\right).\hfill \end{array}$$

By Lemma 7.3, we have

$$\sqrt{n}\mid {\mathit{x}}^{T}Re{m}_{\beta}^{\left(j\right)}\mid \le \sqrt{n}{\Vert \mathit{x}\Vert}_{2}{\Vert Re{m}_{\beta}^{\left(j\right)}\Vert}_{2}={o}_{P}\left({n}^{1\u22152}{b}_{n,s}\right)={o}_{p}\left(\sqrt{N}{s}^{-1\u22152}{b}_{n.s}\right),$$

(7.12)

where we used the boundedness of
* x*. Also,

$$\begin{array}{cc}\hfill \sqrt{Nh}\mid {s}^{-1}\sum _{j=1}^{s}Re{m}_{f}^{\left(j\right)}\left({z}_{0}\right)\mid & \le \sqrt{Nh}{\Vert {\stackrel{~}{K}}_{{z}_{0}}\Vert}_{\mathcal{C}}{\Vert {s}^{-1}\sum _{j=1}^{s}Re{m}_{f}^{\left(j\right)}\Vert}_{\mathcal{C}}\hfill \\ \hfill & \lesssim \sqrt{N}{\Vert {s}^{-1}\sum _{j=1}^{s}Re{m}_{f}^{\left(j\right)}\Vert}_{\mathcal{C}}\hfill \\ \hfill & ={o}_{P}\left(\sqrt{N}{s}^{-1\u22152}{b}_{n,s}\phantom{\rule{thinmathspace}{0ex}}\mathrm{log}\phantom{\rule{thinmathspace}{0ex}}N\right),\hfill \end{array}$$

(7.13)

where the second inequality follows from Lemma A.4. Therefore by (7.12) and (7.13), we have

$$\left(II\right)={o}_{P}\left(\sqrt{N}{s}^{-1\u22152}{b}_{n,s}\phantom{\rule{thinmathspace}{0ex}}\mathrm{log}\phantom{\rule{thinmathspace}{0ex}}N\right).$$

(7.14)

Now by definition of *b _{n,s}* and
condition (3.7), we have
(

$$({\mathit{x}}^{T},1)\left(\begin{array}{c}\hfill \sqrt{n}({\widehat{\beta}}^{\left(j\right)}-{\beta}_{0}^{\left(j\right)\ast})\hfill \\ \hfill \sqrt{Nh}(\stackrel{\u2012}{f}\left({z}_{0}\right)-{f}_{0}^{\ast}\left({z}_{0}\right))\hfill \end{array}\right)\u21ddN(0,{\sigma}^{2}({\mathit{x}}^{T}{\Omega}^{-1}\mathit{x}+{\Sigma}_{22})).$$

Combining (7.11) and (7.14), it follows that if *s* is fixed, then

$$({\mathit{x}}^{T},1)\left(\begin{array}{c}\hfill \sqrt{n}({\widehat{\beta}}^{\left(j\right)}-{\beta}_{0}^{\left(j\right)\ast})\hfill \\ \hfill \sqrt{Nh}(\stackrel{\u2012}{f}\left({z}_{0}\right)-{f}_{0}^{\ast}\left({z}_{0}\right))\hfill \end{array}\right)\u21ddN(0,{\sigma}^{2}({\mathit{x}}^{T}{\Omega}^{-1}\mathit{x}+{\Sigma}_{22}+2{s}^{-1\u22152}{\mathit{x}}^{T}{\Sigma}_{12})).$$

By the arbitrariness of * x*, we
reach the conclusion of the theorem using Wold device.

(i) We first derive the bound of ${\Vert Re{m}^{\left(j\right)}\Vert}_{\mathcal{A}}$. Recall

$$Re{m}^{\left(j\right)}=\frac{1}{n}\sum _{i\in {L}_{j}}\Delta {m}^{\left(j\right)}\left({U}_{i}\right){R}_{{U}_{i}}-{\mathbb{E}}_{U}\left[\Delta {m}^{\left(j\right)}\left(U\right){R}_{U}\right].$$

Let ${Z}_{n}\left(m\right)-{c}_{r}^{-1}{h}^{1\u22152}{n}^{-1\u22152}{\sum}_{i\in {L}_{j}}\{m\left({U}_{i}\right){R}_{{U}_{i}}-\mathbb{E}\left[m\left(U\right){R}_{U}\right]\}$, where *c _{r}* is the constant
specified in Lemma A.4. Note that

$${\Vert g({U}_{i},{m}_{1})-g({U}_{i},{m}_{2})\Vert}_{\mathcal{A}}={c}_{r}^{-1}\sqrt{nh}\left\{{\Vert ({m}_{1}\left({U}_{i}\right)-{m}_{2}\left({U}_{i}\right)){R}_{{U}_{i}}\Vert}_{\mathcal{A}}\phantom{\rule{0ex}{0ex}}+{\Vert \mathbb{E}\left[({m}_{1}\left(U\right)-{m}_{2}\left(U\right)){R}_{U}\right]\Vert}_{\mathcal{A}}\right\}\phantom{\rule{0ex}{0ex}}\le 2\sqrt{n}{\Vert {m}_{1}-{m}_{2}\Vert}_{\mathrm{sup}},$$

where we used the fact that ${\Vert {R}_{u}\Vert}_{\mathcal{A}}\le {c}_{r}{h}^{-1\u22152}$ by Lemma A.4. Note that ${Z}_{n}\left(m\right)={\scriptstyle \frac{1}{n}}{\sum}_{i\in {L}_{j}}g({U}_{i},m)$. Therefore by Lemma G.1, we have for any
*t* > 0,

$$\begin{array}{cc}\hfill \mathbb{P}\left({\Vert {Z}_{n}\left({m}_{1}\right)-{Z}_{n}\left({m}_{2}\right)\Vert}_{\mathcal{A}}\ge t\right)& =\mathbb{P}\left({\Vert \frac{1}{n}\sum _{i=1}^{n}\{g({U}_{i},{m}_{1})-g({U}_{i},{m}_{2})\}\Vert}_{\mathcal{A}}\ge t\right)\hfill \\ \hfill & \le 2\phantom{\rule{thinmathspace}{0ex}}\mathrm{exp}\left(-\frac{{t}^{2}}{8{\Vert {m}_{1}-{m}_{2}\Vert}_{\mathrm{sup}}^{2}}\right)\hfill \end{array}$$

(7.15)

Then by Lemma F.1, we have

$$\mathbb{P}\left(\underset{m\in \mathcal{F}}{\mathrm{sup}}{\Vert {Z}_{n}\left(m\right)\Vert}_{\mathcal{A}}\ge C\omega (\mathcal{F},\mathrm{diam}\left(\mathcal{F}\right))+x\right)\le C\phantom{\rule{thinmathspace}{0ex}}\mathrm{exp}\left(\frac{-{x}^{2}}{C\phantom{\rule{thinmathspace}{0ex}}\mathrm{diam}{\left(\mathcal{F}\right)}^{2}}\right),$$

(7.16)

where $\mathrm{diam}\left(\mathcal{F}\right)={\mathrm{sup}}_{{m}_{1},{m}_{2}\in \mathcal{F}}{\Vert {m}_{1}-{m}_{2}\Vert}_{\mathrm{sup}}$.

Define *d _{n,s}* =

$${\Vert \stackrel{~}{m}\Vert}_{\mathrm{sup}}\le {c}_{r}{h}^{-1\u22152}{\left(2{d}_{n,s}\right)}^{-1}{\Vert \Delta {m}^{\left(j\right)}\Vert}_{\mathcal{A}}\le 1\u22152,$$

where we used the fact that ${\Vert \stackrel{~}{m}\Vert}_{\mathrm{sup}}\le {c}_{r}{h}^{-1\u22152}{\Vert \stackrel{~}{m}\Vert}_{\mathcal{A}}$ by Lemma A.4. This implies $\mid {\mathit{x}}^{T}\stackrel{~}{\beta}+\stackrel{~}{f}\left(z\right)\mid \le 1\u22152$ for any (** x**,

$${\Vert \stackrel{~}{f}\Vert}_{\mathcal{H}}\le {\lambda}^{-1\u22152}{\Vert \stackrel{~}{m}\Vert}_{\mathcal{A}}\le {\lambda}^{-1\u22152}\u2215\left(2{d}_{n,s}\right){\Vert \Delta {m}^{\left(j\right)}\Vert}_{\mathcal{A}}\le {c}_{r}^{-1}{h}^{1\u22152}{\lambda}^{-1\u22152}$$

by the definition of ${\Vert \cdot \Vert}_{\mathcal{A}}$. Hence, we have shown that $\mathcal{E}\subset \{\stackrel{~}{m}\in \mathcal{F}\}$. Combining this fact with (7.16), and noting that $\mathrm{diam}\left(\mathcal{F}\right)=1$, we have

$$\mathbb{P}\left(\{{\Vert {Z}_{n}\left(\stackrel{~}{m}\right)\Vert}_{\mathcal{A}}\ge C\omega (\mathcal{F},1)+x\}\cap \mathcal{E}\right)\le C\phantom{\rule{thinmathspace}{0ex}}\mathrm{exp}(-{x}^{2}\u2215C),$$

(7.17)

By the definition of , and the
relationship $Re{m}^{\left(j\right)}={c}_{r}^{-1}\sqrt{nh}{Z}_{n}\left(\Delta {m}^{\left(j\right)}\right)$, we calculate that ${Z}_{n}\left(\stackrel{~}{m}\right)=(1\u22152){h}^{1\u22152}{n}^{1\u22152}{d}_{n,s}^{-1}Re{m}^{\left(j\right)}=(1\u22152){c}_{r}^{-1}h{n}^{1\u22152}{r}_{n,s}^{-1}Re{m}^{\left(j\right)}$. Plugging the above form of
*Z _{n}*() into (7.17) and letting

$$\mathbb{P}\left(\{{\Vert Re{m}^{\left(j\right)}\Vert}_{\mathcal{A}}\ge {b}_{n,s}\}\cap \mathcal{E}\right)\le C\phantom{\rule{thinmathspace}{0ex}}\mathrm{exp}(-{\mathrm{log}}^{2}\phantom{\rule{thinmathspace}{0ex}}N\u2215C),$$

(7.18)

where we used the definition that ${b}_{n,s}=C{h}^{-1}{n}^{-1\u22152}{r}_{n,s}(\omega (\mathcal{F},1)+\mathrm{log}\phantom{\rule{thinmathspace}{0ex}}N)$. Therefore we have

$$\begin{array}{cc}\hfill \mathbb{P}\left({\Vert Re{m}^{\left(j\right)}\Vert}_{\mathcal{A}}\ge {b}_{n,s}\right)& \le \mathbb{P}\left(\{{\Vert Re{m}^{\left(j\right)}\Vert}_{\mathcal{A}}\ge {b}_{n,s}\}\cap \mathcal{E}\right)+\mathbb{P}\left({\mathcal{E}}^{c}\right)\hfill \\ \hfill & \le C\phantom{\rule{thinmathspace}{0ex}}\mathrm{exp}(-{\mathrm{log}}^{2}\phantom{\rule{thinmathspace}{0ex}}N\u2215C)+\mathbb{P}\left({\mathcal{E}}^{c}\right).\hfill \end{array}$$

(7.19)

We have the following lemma that controls $\mathbb{P}\left({\mathcal{E}}^{c}\right)$.

If Assumption 3.1, 3.2 and Condition (3.6) are satisfied, then
there exist a constant *c* such that

$$\mathbb{P}\left({\mathcal{E}}^{c}\right)=\mathbb{P}\left({\Vert \Delta {m}^{\left(j\right)}\Vert}_{\mathcal{A}}\ge {r}_{n,s}\right)\lesssim n\phantom{\rule{thinmathspace}{0ex}}\mathrm{exp}(-c\phantom{\rule{thinmathspace}{0ex}}{\mathrm{log}}^{2}N).$$

for all *j* =
1,...,*s*.

By Lemma 7.4 and (7.19) we have

$$\mathbb{P}\left({\Vert Re{m}^{\left(j\right)}\Vert}_{\mathcal{A}}\ge {b}_{n,s}\right)\lesssim n\phantom{\rule{thinmathspace}{0ex}}\mathrm{exp}(-c\phantom{\rule{thinmathspace}{0ex}}{\mathrm{log}}^{2}\phantom{\rule{thinmathspace}{0ex}}N).$$

(7.20)

(ii) We will use an Azuma-type inequality in Hilbert space to
control the averaging remainder term ${s}^{-1}{\sum}_{j=1}^{s}Re{m}^{\left(j\right)}$, as all
*Rem*^{(j)} are independent
and have zero mean. Define the event ${\mathcal{A}}_{j}=\{{\Vert Re{m}^{\left(j\right)}\Vert}_{\mathcal{A}}\le {b}_{n,s}\}$. By Lemma G.1, we have

$$\mathbb{P}\left(\left\{{\cap}_{j}{\mathcal{A}}_{j}\right\}\cap \left\{{\Vert {s}^{-1}\sum _{j=1}^{s}Re{m}^{\left(j\right)}\Vert}_{\mathcal{A}}>{s}^{-1\u22152}{b}_{n,s}\phantom{\rule{thinmathspace}{0ex}}\mathrm{log}\phantom{\rule{thinmathspace}{0ex}}N\right\}\right)\le 2\phantom{\rule{thinmathspace}{0ex}}\mathrm{exp}(-{\mathrm{log}}^{2}\phantom{\rule{thinmathspace}{0ex}}N\u22152).$$

(7.21)

Moreover, by (7.20),

$$\mathbb{P}\left({\mathcal{A}}_{j}^{c}\right)\lesssim n\phantom{\rule{thinmathspace}{0ex}}\mathrm{exp}(-c\phantom{\rule{thinmathspace}{0ex}}{\mathrm{log}}^{2}\phantom{\rule{thinmathspace}{0ex}}N).$$

(7.22)

Hence it follows that

$$\mathbb{P}\left({\Vert {s}^{-1}\sum _{j=1}^{s}Re{m}^{\left(j\right)}\Vert}_{\mathcal{A}}>{s}^{1\u22152}{b}_{n,s}\phantom{\rule{thinmathspace}{0ex}}\mathrm{log}\phantom{\rule{thinmathspace}{0ex}}N\right)\le \mathbb{P}\left(\left\{{\cap}_{j=1}^{s}{\mathcal{A}}_{j}\right\}\cap \left\{{\Vert {s}^{-1}\sum _{j=1}^{s}Re{m}^{\left(j\right)}\Vert}_{\mathcal{A}}>{s}^{-1}{b}_{n,s}\phantom{\rule{thinmathspace}{0ex}}\mathrm{log}\phantom{\rule{thinmathspace}{0ex}}N\right\}\right)+\mathbb{P}\left({\cup}_{j}{\mathcal{A}}_{j}^{c}\right)$$

as *N* → ∞, where the last
inequality follows from (7.21), (7.22) and union bound. This finishes the proof.

We can apply similar arguments as above to bound ${\Vert Re{m}_{f}^{\left(j\right)}\Vert}_{\mathcal{C}}$ and ${\Vert 1\u2215s{\sum}_{j=1}^{s}Re{m}_{f}^{\left(j\right)}\Vert}_{\mathcal{C}}$, by changing $\omega (\mathcal{F},1)$ to $\omega ({\mathcal{F}}_{2},1)$, which is dominated by $\omega (\mathcal{F},1)$. The bounds of $\nabla {\Vert Re{m}_{\beta}^{\left(j\right)}\Vert}_{2}$ and ${\Vert 1\u2215s{\sum}_{j=1}^{s}Re{m}_{\beta}^{\left(j\right)}\Vert}_{2}$ then follow from triangular inequality.

In view of Theorem 3.4, we first prove

$$\left(\begin{array}{c}\hfill \sqrt{n}({\beta}_{0}^{\left(j\right)\ast}-{\beta}_{0}^{\left(j\right)})\hfill \\ \hfill \sqrt{Nh}({f}_{0}^{\ast}\left({z}_{0}\right)-{f}_{0}\left({z}_{0}\right)-{W}_{\lambda}{f}_{0}\left({z}_{0}\right))\hfill \end{array}\right)\to 0$$

(7.23)

for both (i) and (ii). By Proposition 2.3, we have

$$\left(\begin{array}{c}\hfill {\beta}_{0}^{\left(j\right)\ast}-{\beta}_{0}^{\left(j\right)}\hfill \\ \hfill {f}_{0}^{\ast}\left({z}_{0}\right)-{f}_{0}\left({z}_{0}\right)\hfill \end{array}\right)=\left(\begin{array}{c}\hfill {L}_{\lambda}{f}_{0}\hfill \\ \hfill {W}_{\lambda}{f}_{0}\left({z}_{0}\right)+\mathit{A}{\left({z}_{0}\right)}^{T}{L}_{\lambda}{f}_{0}\hfill \end{array}\right).$$

(7.24)

By Lemma A.5, it follows that under Assumption 3.3, ${\Vert {L}_{\lambda}{f}_{0}\Vert}_{2}\lesssim \lambda $. Now we turn to ${f}_{0}^{\ast}\left({z}_{0}\right)-{f}_{0}\left({z}_{0}\right)$. Observe that

$$\mathit{A}\left(z\right)={\langle \mathit{A},{\stackrel{~}{K}}_{z}\rangle}_{\mathcal{C}}={\langle \mathit{B},{\stackrel{~}{K}}_{z}\rangle}_{{L}_{2}\left({\mathbb{P}}_{Z}\right)}=\sum _{\ell =1}^{\infty}\frac{{\langle \mathit{B},{\varphi}_{\ell}\rangle}_{{L}_{2}\left({\mathbb{P}}_{Z}\right)}}{1+\lambda \u2215{\mu}_{\ell}}{\varphi}_{\ell}\left(z\right),$$

(7.25)

Applying Cauchy-Schwarz, we obtain

$$\begin{array}{cc}\hfill {A}_{k}{\left({z}_{0}\right)}^{2}& \le \left(\sum _{\ell =1}^{\infty}\frac{{\langle {\mathit{B}}_{k},{\varphi}_{\ell}\rangle}_{{L}_{2}\left({\mathbb{P}}_{Z}\right)}^{2}}{{\mu}_{\ell}}{\varphi}_{\ell}^{2}\left({z}_{0}\right)\right)\left(\sum _{\ell =1}^{\infty}\frac{{\mu}_{\ell}}{{(1+\lambda \u2215{\mu}_{\ell})}^{2}}\right)\hfill \\ \hfill & \le {c}_{\varphi}^{2}{\Vert {B}_{k}\Vert}_{\mathcal{H}}^{2}\mathrm{Tr}\left(K\right),\hfill \end{array}$$

where the last inequality follows from the uniform
boundedness of ${\varphi}_{\ell}$. Hence we have that
*A _{k}*(

$$\mathit{A}{\left({z}_{0}\right)}^{T}{L}_{\lambda}{f}_{0}\le {\Vert \mathit{A}\left({z}_{0}\right)\Vert}_{2}{\Vert {L}_{\lambda}{f}_{0}\Vert}_{2}\lesssim \lambda .$$

Therefore, if we choose $\lambda =o({\left(Nh\right)}^{-1\u22152}\wedge {n}^{-1\u22152})$, then we get (7.23), which eliminates the estimation bias for ${\beta}_{0}^{\left(j\right)}$.

Now we consider the asymptotic variance for cases (i) and (ii). It suffices to show that ${\alpha}_{{z}_{0}}={\mathrm{lim}}_{N\to \infty}{h}^{1\u22152}{W}_{\lambda}\mathit{A}\left({z}_{0}\right)$. By Lemma A.2 and (7.25), we have

$$\begin{array}{cc}\hfill {W}_{\lambda}{A}_{k}\left({z}_{0}\right)=& \sum _{\ell =1}^{\infty}\frac{{\langle {B}_{k},{\varphi}_{\ell}\rangle}_{{L}_{2}\left({\mathbb{P}}_{Z}\right)}}{1+\lambda \u2215{\mu}_{\ell}}\frac{\lambda}{\lambda +{\mu}_{\ell}}{\varphi}_{\ell}\left({z}_{0}\right)\hfill \\ \hfill & \le \left(\sum _{\ell =1}^{\infty}\frac{{\langle {B}_{k},{\varphi}_{\ell}\rangle}_{{L}_{2}\left({\mathbb{P}}_{Z}\right)}^{2}}{{\mu}_{\ell}}{\varphi}_{\ell}^{2}\left({z}_{0}\right)\right)\left(\sum _{\ell =1}^{\infty}\frac{{\mu}_{\ell}}{{(1+\lambda \u2215{\mu}_{\ell})}^{2}}\right)\hfill \\ \hfill & \le {c}_{\varphi}^{2}{\Vert {B}_{k}\Vert}_{\mathcal{H}}^{2}\mathrm{Tr}\left(K\right).\hfill \end{array}$$

Hence by dominated conference theorem, as
*λ* → 0 we have ${W}_{\lambda}{A}_{k}\left({z}_{0}\right)\to 0$. As *h* = *O*(1), it follows
that ${\alpha}_{{z}_{0}}={\mathrm{lim}}_{N\to \infty}{h}^{1\u22152}{W}_{\lambda}\mathit{A}\left({z}_{0}\right)=0$.

When *h* → 0, we have ${\gamma}_{{z}_{0}}=-{\mathrm{lim}}_{N\to \infty}{h}^{1\u22152}\mathit{A}\left({z}_{0}\right)=0$, as
*A _{k}*(

Research Sponsored by NSF IIS1408910, NSF IIS1332109, NIH R01MH102339, NIH R01GM083084, and NIH R01HG06841

Research Sponsored by NSF CAREER Award DMS-1151692, DMS-1418042, Simons Foundation 305266. Guang Cheng was on sabbatical at Princeton while part of this work was carried out; he would like to thank the Princeton ORFE department for its hospitality.

*AMS 2000 subject classifications:* Primary 62G20,
62F25; secondary 62F10, 62F12

^{1}The commonality estimator
* _{N,λ}* can be adjusted as a
weighted sum ${\sum}_{j=1}^{s}({n}_{j}\u2215N){\widehat{f}}_{n,\lambda}^{\left(j\right)}$ if sub-sample sizes are different. In particular, the
divide-and-conquer method can be applied to those sub-populations with huge
sample sizes; see Section 5.

SUPPLEMENTARY MATERIAL

Supplementary material for: A Partially Linear Framework for Massive Heterogenous Data (DOI: To Be Assigned; .pdf). We provide the detailed proof in the supplement.

- Aitkin M, Rubin DB. Estimation and hypothesis testing in finite mixture models. Journal of the Royal Statistical Society. Series B (Methodological) 1985:67–75.
- Bach F. Sharp analysis of low-rank kernel matrix approximations. arXiv preprint arXiv. 2012:1208.2015.
- Berlinet A, Thomas-Agnan C. Reproducing kernel Hilbert spaces in probability and statistics. Vol. 3. Springer; 2004.
- Birman MŠ, Solomjak M. Piecewise-polynomial approximations of functions of the classes img align= absmiddle alt= w_p^ α tex_sm_2343_img1/img. Sbornik: Mathematics. 1967;2:295–317.
- Carl B, Triebel H. Inequalities between eigenvalues, entropy numbers, and related quantities of compact operators in banach spaces. Mathematische Annalen. 1980;251:129–133.
- Chen X, Xie M. Tech. rep., Technical Report 2012-01. Dept. Statistics, Rutgers Univ; 2012. A split-and-conquer approach for analysis of extraordinarily large data.
- Cheng G, Shang Z. Joint asymptotics for semi-nonparametric models under penalization. arXiv. 2013:1311.2628.
- Chernozhukov V, Chetverikov D, Kato K, et al. Gaussian approximations and multiplier bootstrap for maxima of sums of high-dimensional random vectors. The Annals of Statistics. 2013;41:2786–2819.
- Fan J, Zhang W. Statistical estimation in varying coefficient models. Annals of Statistics. 1999:1491–1518.
- Figueiredo MA, Jain AK. Unsupervised learning of finite mixture models. Pattern Analysis and Machine Intelligence, IEEE Transactions on. 2002;24:381–396.
- Gu C. Smoothing spline ANOVA models. Vol. 297. Springer; 2013.
- Guo W. Inference in smoothing spline analysis of variance. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 2002;64:887–898.
- Hastie T, Tibshirani R. Varying-coefficient models. Journal of the Royal Statistical Society. Series B (Methodological) 1993:757–796.
- Huang J, Zhang T. The benefit of group sparsity. The Annals of Statistics. 2010;38:1978–2004.
- Kleiner A, Talwalkar A, Sarkar P, Jordan M. The big data bootstrap. arXiv preprint arXiv. 2012:1206.6415.
- Kosorok MR. Introduction to empirical processes and semiparametric inference. Springer; 2007.
- Krasikov I. New bounds on the hermite polynomials. arXiv preprint math/0401310. 2004
- Lafferty JD, Lebanon G. Diffusion kernels on statistical manifolds. 2005
- Mammen E, van de Geer S. Penalized quasi-likelihood estimation in partial linear models. The Annals of Statistics. 1997:1014–1035.
- McDonald R, Hall K, Mann G. Distributed training strategies for the structured perceptron. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics. Association for Computational Linguistics. 2010
- McLachlan G, Peel D. Finite mixture models. John Wiley & Sons; 2004.
- Meinshausen N, Bühlmann P. Maximin effects in inhomogeneous large-scale data. arXiv preprint arXiv. 2014:1406.0596.
- Mendelson S. In Computational Learning Theory. Springer; 2002. Geometric parameters of kernel machines.
- Nardi Y, Rinaldo A. On the asymptotic properties of the group lasso estimator for linear models. Electronic Journal of Statistics. 2008;2:605–633.
- Obozinski G, Wainwright MJ, Jordan MI. Union support recovery in high-dimensional multivariate regression.. Communication, Control, and Computing, 2008 46th Annual Allerton Conference on; IEEE; 2008.
- Pinelis I. Optimum bounds for the distributions of martingales in banach spaces. The Annals of Probability. 1994:1679–1706.
- Raskutti G, Wainwright MJ, Yu B. Early stopping and non-parametric regression: an optimal data-dependent stopping rule. The Journal of Machine Learning Research. 2014;15:335–366.
- Shang Z, Cheng G. Local and global asymptotic inference in smoothing spline models. The Annals of Statistics. 2013;41:2608–2638.
- Shawe-Taylor J, Cristianini N. Kernel methods for pattern analysis. Cambridge university press; 2004.
- Sollich P, Williams CK. Deterministic and Statistical Methods in Machine Learning. Springer; 2005. Understanding gaussian process regression using the equivalent kernel. pp. 211–228.
- Städler N, Bühlmann P, van de Geer S. #2113
_{1}-penalization for mixture regression models. Test. 2010;19:209–256. - Steinwart I, Hush DR, Scovel C, et al. Optimal rates for regularized least squares regression. COLT; 2009.
- Stewart GW, Sun J.-g. Matrix perturbation theory. 1990
- Stone CJ. Additive regression and other nonparametric models. The annals of Statistics. 1985:689–705.
- Tropp JA. User-friendly tail bounds for sums of random matrices. Foundations of Computational Mathematics. 2012;12:389–434.
- Tsybakov AB, Zaiats V. Introduction to nonparametric estimation. Vol. 11. Springer; 2009.
- Van Der Vaart AW, Wellner JA. Weak Convergence. Springer; 1996.
- van Handel R. Probability in high dimension: Lecture notes. 2014
- Wahba G. Spline models for observational data. Vol. 59. Siam; 1990.
- Wang X, Dunson DB. Parallel mcmc via weierstrass sampler. arXiv preprint arXiv. 2013:1312.4605.
- Wang Y. Smoothing splines: methods and applications. CRC Press; 2011.
- Wasserman L. Steins method and the bootstrap in low and high dimensions: A tutorial. 2014
- Williamson RC, Smola AJ, Scholkopf B. Generalization performance of regularization networks and support vector machines via entropy numbers of compact operators. Information Theory, IEEE Transactions on. 2001;47:2516–2532.
- Yang Y, Barron A. Information-theoretic determination of minimax rates of convergence. Annals of Statistics. 1999:1564–1599.
- Zhang T. Learning bounds for kernel regression using effective data dimensionality. Neural Computation. 2005;17:2077–2098. [PubMed]
- Zhang Y, Duchi J, Wainwright M. Divide and conquer kernel ridge regression. Conference on Learning Theory. 2013

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |