PMCCPMCCPMCC

Search tips
Search criteria 

Advanced

 
Logo of nihpaAbout Author manuscriptsSubmit a manuscriptHHS Public Access; Author Manuscript; Accepted for publication in peer reviewed journal;
 
Ann Stat. Author manuscript; available in PMC 2017 April 18.
Published in final edited form as:
Ann Stat. 2016 August; 44(4): 1400–1437.
Published online 2016 July 7. doi:  10.1214/15-AOS1410
PMCID: PMC5394596
NIHMSID: NIHMS751904

A PARTIALLY LINEAR FRAMEWORK FOR MASSIVE HETEROGENEOUS DATA

Tianqi Zhao, Guang Cheng, Associate Professor,§ and Han Liu

Abstract

We consider a partially linear framework for modelling massive heterogeneous data. The major goal is to extract common features across all sub-populations while exploring heterogeneity of each sub-population. In particular, we propose an aggregation type estimator for the commonality parameter that possesses the (non-asymptotic) minimax optimal bound and asymptotic distribution as if there were no heterogeneity. This oracular result holds when the number of sub-populations does not grow too fast. A plug-in estimator for the heterogeneity parameter is further constructed, and shown to possess the asymptotic distribution as if the commonality information were available. We also test the heterogeneity among a large number of sub-populations. All the above results require to regularize each sub-estimation as though it had the entire sample size. Our general theory applies to the divide-and-conquer approach that is often used to deal with massive homogeneous data. A technical by-product of this paper is the statistical inferences for the general kernel ridge regression. Thorough numerical results are also provided to back up our theory.

Keywords: partially linear model, heterogenous data, massive data, reproducing kernel Hilbert space, joint asymptotics, mean square error, bias propagation

1. Introduction

In this paper, we propose a partially linear regression framework for modelling massive heterogeneous data. Let {(Yi,Xi,Zi)}i=1N be samples from an underlying distribution that may change with N. We assume that there exist s independent sub-populations, and the data from the jth sub-population follows a partially linear model:

Y=XTβ0(j)+f0(Z)+ε,
(1.1)

where ε has zero mean and known variance σ2. In the above model, Y depends on X through a linear function that may vary across all sub-populations, and depends on Z through a nonlinear function that is common to all sub-populations. The possibly different values of β0(j) are viewed as the source of heterogeneity. We also point out that the sample sizes in some sub-populations could be extremely high in reality. Note that (1.1) is a typical “semi-nonparametric” model (Cheng and Shang, 2013) since we want to infer both commonality and heterogeneity components throughout the paper.

The model (1.1) is motivated by the following scenario: different labs conduct the same experiment on the relationship between a response variable Y (e.g., heart disease) and a set of predictors Z, X1, X2,...,Xp. It is known from biological knowledge that the dependence structure between Y and Z (e.g., blood pressure) should be homogeneous for all human. However, for the other covariates (e.g., certain genes), we allow their (linear) relations with Y to potentially vary in different labs. For example, the genetic functionality of different races might be heterogenous. The linear relation is assumed here for simplicity, and particularly suitable when the covariates are discrete.

Statistical modelling for massive data has attracted a flurry of recent research. For homogeneous data, the statistical studies of the divide-and-conquer method currently focus on either parametric inferences, e.g., Bag of Little Bootstraps (Kleiner et al., 2012), and parallel MCMC computing (Wang and Dunson, 2013), or nonparametric minimaxity (Zhang et al., 2013). The other relevant work includes high dimensional linear models with variable selection (Chen and Xie, 2012) and structured perceptron (McDonald et al., 2010). Heterogenous data are often handled by fitting mixture models (Aitkin and Rubin, 1985; McLachlan and Peel, 2004; Figueiredo and Jain, 2002), time varying coefficient models (Hastie and Tibshirani, 1993; Fan and Zhang, 1999) or multitask regression (Huang and Zhang, 2010; Nardi and Rinaldo, 2008; Obozinski et al., 2008). The recent high dimensional work includes Stäadler et al. (2010); Meinshausen and Bühlmann (2014). However, as far as we are aware, semi-nonparametric inference for massive homogeneous/heterogeneous data still remains untouched.

In this paper, our primary goal is to extract common features across all sub-populations while exploring heterogeneity of each sub-population. Specifically, we employ a simple aggregation procedure, which averages commonality estimators across all sub-populations, and then construct a plug-in estimator for each heterogeneity parameter based on the combined estimator for commonality. The secondary goal is to apply the divide-and-conquer method to the sub-population having a huge sample size that is unable to be processed in one single computer. The above purposes are achieved by estimating our statistical model (1.1) with the kernel ridge regression (KRR) method. The KRR framework is known to be very flexible and well supported by the general reproducing kernel Hilbert space (RKHS) theory (Mendelson, 2002; Steinwart et al., 2009; Zhang, 2005). In particular, the partial smoothing spline model (Wahba, 1990) can be viewed a special case. An important technical contribution of this paper is that a (point-wise) limit distribution of the KRR estimate is established by generalizing the smoothing spline inference results in Cheng and Shang (2013). This theoretical innovation makes our work go beyond the existing statistical study on the KRR estimation in large datastes, which mainly focus on their nonparametric minimaxity, e.g., Zhang et al. (2013); Bach (2012); Raskutti et al. (2014).

Our theoretical studies are mostly concerned with the so-called “oracle rule” for massive data. Specifically, we define the “oracle estimate” for commonality (heterogeneity) as the one computed when all the heterogeneity information are given (the commonality information is given in each-subpopulation), i.e., β0(j)'s are known (f0 is known). We claim that a commonality estimator satisfies the oracle rule if it possesses the same minimax optimality and asymptotic distribution as the “oracle estimate” defined above. A major contribution of this paper is to derive the largest possible diverging rate of s under which our combined estimator for commonality satisfies the oracle rule. In other words, our aggregation procedure can “filter out” the heterogeneity in data when s does not grow too fast with N. Interestingly, we have to set a lower bound on s for our heterogeneity estimate to possess the asymptotic distribution as if the commonality information were available, i.e., oracle rule. Our second contribution is to test the heterogeneity among a large number of sub-populations by employing the recent Gaussian approximation theory (Chernozhukov et al., 2013). The above results directly apply to the divide-and-conquer approach that deals with the sub-population with a huge sample size. In this case, the “oracle estimate” is defined as those computed based on the entire (homogeneous) data in those sub-populations. A rather different goal here is to explore the most computationally efficient way to split the whole sample while performing the best possible statistical inference. Specifically, we derive the largest possible number of splits under which the averaged estimators for both components enjoy the same statistical properties as the oracle estimators.

In both homogeneous and heterogeneous setting above, we note that the upper bounds established for s increase with the smoothness of f0. Hence, our aggregation procedure favors smoother regression functions in the sense that more sub-populations/splits are allowed in the massive data. On the other hand, we have to admit that our upper and lower bound results for s are only sufficient conditions although empirical results show that our bounds are quite sharp. Another interesting finding is that even the semi-nonparametric estimation is applied to only one fraction of the entire data, it is nonetheless essential to regularize each sub-estimation as if it had the entire sample.

In the end, we highlight two key technical challenges: (i) nontrivial interaction between the parametric and nonparametric components in the semi-nonparametric estimation. In particular, we observe a “bias propagation” phenomenon: the bias introduced by the penalization of the nonparametric component propagates to the parametric component, and the resulting parametric bias in turn propagates back to the nonparametric component. To analyze this complicated propagation mechanism, we extend the existing RKHS theory to an enlarged partially linear function space by defining a novel inner product under which the expectation of the Hessian of the objective function becomes identity. (ii) double asymptotics framework in terms of diverging s and N. In this challenging regime, we develop more refined concentration inequalities in characterizing the concentration property of an aggregated empirical processes. These delicate theoretical analysis show that an average of s asymptotic linear expansions is still a valid one as sN.

The rest of the paper is organized as follows: Section 2 briefly introduces the general RKHS theory and discusses its extension to an enlarged partially linear function space. Section 3 describes our aggregation procedure, and studies the “oracle” property of this procedure from both asymptotic and non-asymptotic perspectives. The efficiency boosting of heterogeneity estimators and heterogenous testing results are also presented in this section. Section 4 applies our general theory to various examples with different smoothness. Section 5 is devoted to the analysis of divide-and-conquer algorithms for homogeneous data. Section 6 presents some numerical experiments. All the technical details are deferred to Section 7 or Online Supplementary.

2. Preliminaries

In this section, we briefly introduce the general RKHS theory, and then extend it to a partially linear function space. Below is a generic definition of RKHS (Berlinet and Thomas-Agnan, 2004):

Definition 2.1

Denote by F(S,R) a vector space of functions from a general set S to R. We say that H is a reproducing kernel Hilbert space (RKHS) on S, provided that:

  • (i) H is a vector subspace of F(S,R);
  • (ii) H is endowed with an inner product, denoted as ,H, under which it becomes a Hilbert space;
  • (iii) for every yS, the linear evaluation functional defined by Ey(f) = f(y) is bounded.

If H is a RKHS, by Riesz representation, we have that for every yS, there exists a unique vector, KyH, such that for every fH,f(y)=f,KyH. The reproducing kernel for H is defined as K(x,y)=Ky(x).

Denote U(X,Z)X×ZRp×R, and PU as the distribution of U (PX and PZ are defined similarly). According to Definition 2.1, if S=Z and F(Z,R)=L2(PZ), then we can define a RKHS HL2(PZ) (endowed with a proper inner product ,H), in which the true function f0 is believed to lie. The corresponding kernel for H is denoted by K such that for any zZ: f(z)=f,KzH. By Mercer theorem, this kernel function has the following unique eigen-decomposition:

K(z1,z2)==1μϕ(z1)ϕ(z2),

where μ1μ2 ≥ ... > 0 are eigenvalues and {ϕ}=1 are an orthonormal basis in L2(PZ). Let {θ}=1 be the Fourier coefficient of f0 under the basis {ϕ}=1 (given the L2(PZ)-inner product ,L2(PZ)). Mercer theorem together with the reproducing property implies that ϕi,ϕjH=δijμi, where δij is the Kronecker's delta. The smoothness of the functions in RKHS can be characterized by the decaying rate of {μ}=1. Below, we present three different decaying rates together with the corresponding kernel functions.

Finite rank kernel

the kernel has finite rank r if μ=0 for all >r. For example, the linear kernel K(z1,z2)=z1,z2Rd has rank d, and generates a d-dimensional linear function space. The eigenfunctions are given by ϕ(z)=z for =1,,d. The polynomial kernel K(z1, z2) = (1 + z1 z2)d has rank d + 1, and generates a space of polynomial functions with degree at most d. The eigenfunctions are given by ϕ=z1 for =1,,d+1.

Exponentially decaying kernel

the kernel has eigenvalues that satisfy μc1exp(c2p) for some c1 c2 > 0. An example is the Gaussian kernel K(z1, z2) = exp(–|z1z2|2). The eigenfunctions are given by Sollich and Williams (2005)

ϕ(x)=(54)14(22(1)!)12e(51)x24H1((52)12x),
(2.1)

for =1,2,, where H() is the -th Hermite polynomial.

Polynomially decaying kernel

the kernel has eigenvalues that satisfy μ2ν for some ν ≥ 1/2. Examples include those underlying for Sobolev space and Besov space (Birman and Solomjak, 1967). In particular, the eigenfunctions of a ν-th order periodic Sobolev space are trigonometric functions as specified in Section 4.3. The corresponding Sobolev kernels are given in Gu (2013).

In this paper, we consider the following penalized estimation:

(β^,f^)=argmin(β,f)A{1ni=1n(YiXiTβf(Zi))2+λfH2},
(2.2)

where λ > 0 is a regularization parameter and A is defined as the parameter space Rp×H. For simplicity, we do not distinguish m=(β,f)A from its associated function mM{xTβ+f(z):(β,f)Aand(x,z)X×Z} throughout the paper. We call (β^,f^) as partially linear kernel ridge regression (KRR) estimate in comparison with the nonparametric KRR estimate in Shawe-Taylor and Cristianini (2004). In particular, when H is a ν-th order Sobolev space endowed with f,f~HZf(ν)(z)f~(ν)(z)dz, (β^,f^) becomes the commonly used partial smoothing spline estimate.

We next illustrate that A can be viewed as a partially linear extension of H in the sense that it shares some nice reproducing properties as this RKHS H under the following inner product:

m,m~A=XTβ+f(Z),XTβ~+f~(Z)L2(PU)+λf,f~H,
(2.3)

where m=(β,f)A and m~=(β~,f~)A. Similar as the kernel function Kz, we can construct a linear operator Ru()A such that Ru,mA=m(u) for any uX×Z. Also, construct another linear operator Pλ:AA such that Pλm,m~A=λf,f~H for any m and m~. See Proposition 2.3 for the construction of Ru and Pλ.

We next present a proposition illustrating the rational behind the definition of ,A. Denote as the outer product on A. Hence, EU[RURU]+Pλ is an operator from A to A.

Proposition 2.2

EU[RURU]+Pλ=id, where id is an identity operator on A.

Proof

For any m(β,f)A and m~=(β~,f~)A, we have

(EU[RURU]+Pλ)m,m~A=EU[RURU]m,m~A+Pλm,m~A=EU[m(U)m~(U)]+λf,f~H=m,m~A.

Since the choice of (m,m~) is arbitrary, we conclude our proof.

As will be seen in the subsequent analysis, e.g., in Theorem 3.3, the operator E[RURU]+Pλ is essentially the expectation of the Hessian of the objective function (w.r.t. Fréchet derivative) minimized in (2.2). Proposition 2.2 shows that the inversion of this Hessian matrix is trivial when the inner product is designed as in (2.3). Due to that, the theoretical analysis of m^=(β^,f^) based on the first order optimality condition becomes much more transparent.

To facilitate the construction of Ru and Pλ, we need to endow a new inner product with H:

f,f~C=f,f~L2(PZ)+λf,f~H,
(2.4)

for any f,f~H. Under (2.4), H is still a RKHS as the evaluation functional is bounded by Lemma A.1. We denote the new kernel function as K~(,), and define a positive definite self-adjoint operator Wλ:HH:

Wλf,f~C=λf,f~Hfor anyf,f~H,
(2.5)

whose existence is proven in Lemma A.2. Since fL2(PZ)2=i=1θi2 and fH2=i=1θi2μi, we now have fC2=i=1θi2(1+λμi) by (2.4). We next define two crucial quantities needed in the construction: BkE[XkZ] and its Riesz representer AkH satisfying Ak,fC=Bk,fL2(PZ) for all fH. Here, we implicity assume Bk is square integrable. The existence of Ak follows from the boundedness of the linear functional BkfBk,fL2(PZ) (by Riesz's representer theorem) as follows:

Bkf=Bk,fL2(PZ)BkL2(PZ)fL2(PZ)BkL2(PZ)fC.

We are now ready to construct Ru and Pλ based on K~z, Wλ, B and A introduced above, where B = (B1,..., Bp)T and A = (A1,...,Ap)T. Define Ω=E[(XB)(XB)T] and Σλ=E[B(Z)(B(Z)A(Z))T].

Proposition 2.3

For any u = (x, z), Ru can be expressed as Ru:u(Lu,Nu)A, where

Lu=(Ω+Σλ)1(xA(z))andNu=K~zATLu,

Moreover, for any m=(β,f)A, Pλm can be expressed as Pλm=(Lλf,Nλf)A, where

Lλf=(Ω+Σλ)1B,WλfL2(PZ)andNλf=WλfATLλf.

Notation

Denote || · ||2 and || · || as the Euclidean L2 and infinity norm in Rp, respectively. For any function f:ZR, let fsup=supzZf(z). We use || · || to denote the spectral norm of matrices. For positive sequences an and bn, we write anbn(anbn) if there exists some universal constant constant c > 0 (c′ > 0) independent of n such that ancbn (ancbn) for all nN. We denote anbn if both anbn and anbn. We define h as the inverse of =11(1+λμ), which is known to be the “effective dimension” of a kernel K (w.r.t.L2(PZ)) (Zhang, 2005). For any function space F, define

ω(F,δ)=0δlogN(F,sup,ϵ)dϵ,

where N(F,sup,ϵ) is an [set membership]-covering number of F w.r.t. supreme norm. Define the following sets of functions: F1={f(x)=xTβ:fsup1for allxX,βRp}, F2={fH:fsup1,fHh12λ12}, F{f=f1+f2:f1F1,f2F2,fsup12}.

3. Heterogeneous Data: Aggregation of Commonality

In this section, we start from describing our aggregation procedure and model assumptions in Section 3.1. The main theoretical results are presented in Sections 3.2–3.4 showing that our combined estimate for commonality enjoys the “oracle property”. To be more specific, we show that it possesses the same (non-asymptotic) minimax optimal bound (in terms of mean-squared error) and asymptotic distribution as the “oracle estimate” f^or computed when all the heterogeneity information are available:

f^or=argminfH{1Nj=1siLj(Yi(β0(j))TXif(Zi))2+λfH2}.
(3.1)

The above nice properties hold when the number of sub-populations does not grow too fast and the smoothing parameter is chosen according to the entire sample size N. Based on this combined estimator, we further construct a plug-in estimator for each heterogeneity parameter β0(j), which possesses the asymptotic distribution as if the commonality were known, in Section 3.5. Interestingly, this oracular result holds when the number of sub-population is not too small. In the end, Section 3.6 tests the possible heterogeneity among a large number of sub-populations.

3.1. Method and Assumptions

The heterogeneous data setup and averaging procedure are described below:

1. Obverse data (Xi, Zi, Yi) with the known label Li [set membership] {1, 2,...,s indicating the sub-population it belongs to, for i = 1,...,N. The size of samples from each sub-population is assumed to be the same, denoted by n, for simplicity. Hence, N = n × s.

2. On the j-th sub-population, obtain the following penalized estimator:

(β^n,λ(j),f^n,λ(j)=argmin(β,f)A{1niLj(XiTβ+f(Zi)Yi)2+λfH2}.
(3.2)

3. Obtain the final nonparametric estimate1 for commonality by averaging:

fN,λ=1sj=1sf^n,λ(j).
(3.3)

We point out that β^n,λ(j) is not our final estimate for heterogeneity. In fact, it can be further improved based on fN,λ; see Section 3.5.

For simplicity, we will drop the subscripts (n, λ) and (N, λ) in those notation defined in (3.2) and (3.3) throughout the rest of this paper. The main assumptions of this section are stated below.

Assumption 3.1 (Regularity Condition)

(i) εi's are i.i.d. sub-Gaussian random variables independent of the designs; (ii) BkL2(PZ) for all k, and ΩE[(XB(Z))(XB(Z))T] is positive definite; (iii) Xi's are uniformly bounded by a constant cx.

Conditions in Assumption 3.1 are fairly standard in the literature. For example, the positive definiteness of Ω is needed for obtaining semiparametric efficient estimation; see Mammen and van de Geer (1997). Note that we do not require the independence between X and Z throughout the paper.

Assumption 3.2 (Kernel Condition)

We assume that there exist 0 < cϕ < ∞ and 0 < cK < ∞ such that supϕsupcϕ and supz K(z, z) ≤ cK.

Assumption 3.2 is commonly assumed in kernel ridge regression literature (Zhang et al., 2013; Lafferty and Lebanon, 2005; Guo, 2002). In the case of finite rank kernel, e.g., linear and polynomial kernels, the eigenfunctions are uniformly bounded as long as Ƶ has finite support. As for the exponentially decaying kernels such as Gaussian kernel, we prove in Section 4.2 that the eigenfunctions given in (2.1) are uniformly bounded by 1.336. Lastly, for the polynomially decaying kernels, Proposition 2.2 in Shang and Cheng (2013) showed that the eigenfunctions induced from a ν-th order Sobolev space (under a proper inner product ,H) are uniformly bounded under mild smoothness conditions for the density of Z.

Assumption 3.3

For each k=1,,p,Bk()H. This is equivalent to

=1μ1Bk,ϕL2(PZ)2<.

Assumption 3.3 requires the conditional expectation of Xk given Z = z is as smooth as f0(z). As can be seen in Section 3.4, this condition is imposed to control the bias of the parametric component, which is caused by penalization on the the nonparametric component. We call this interaction as the “bias propagation phenomenon”, and study it in Section 3.4.

3.2. Non-Asymptotic Bound for Mean-Squared Error

The primary goal of this section is to evaluate the estimation quality of the combined estimate from a non-asymptotic point of view. Specifically, we derive a finite sample upper bound for the mean-squared error MSE(f)E[ff0L2(PZ)2]. When s does not grow too fast, we show that MSE(f) is of the order O((Nh)1+λ), from which the aggregation effect on f can be clearly seen. If λ is chosen in the order of N, the mean-squared error attains the (un-improvable) optimal minimax rate. As a by-product, we establish a non-asymptotic upper bound for the mean-squared error of β^(j), i.e., MSE(β^(j))E[β^(j)β0(j)22]. The results in this section together with Theorem 3.4 in Section 3.4 determine an upper bound of s under which f enjoys the same statistical properties (minimax optimality and asymptotic distribution) as the oracle estimate for.

Define τmin(Ω) as the minimum eigenvalue of Ω and Tr(K)=1μ as the trace of K. Moreover, let C1=2τmin2(Ω)(cx2p+cϕ2Tr(K)k=1pBkH2), C1=2cϕ2(1+C1k=1pBkL2(PZ)2), C2=τmin2(Ω)f0H2k=1pBkH2 and C2=2C2k=1pBkL2(PZ)2.

Theorem 3.1

Suppose that Assumptions 3.1–3.3 hold. If s=o(Nh2(ω(F,1)+logN)2), then we have

MSE(f)C1σ2(Nh)1+2f0H2λ+C2λ2+s1a(n,s,h,λ,ω),
(3.4)

where a(n,s,h,λ,ω)=Ch1n1rn,s2(ω(F,1)2+1)+Ch2λ1nexp(clog2N), rn,s = (nh)−1/2log2 N + λ and C is some generic constant.

Typically, we require an upper bound for s so that the third term in the R.H.S. of (3.4) can be dominated by the first two terms, which correspond to variance and bias, respectively. Hence, we choose λ(Nh)1 to attain the optimal bias-variance trade-off. The resulting rate coincides with the minimax optimal rate of the oracle estimate in different RKHS; see Section 4. This can be viewed as a non-asymptotic version of the “oracle property” of f. In comparison with the nonparametric KRR result in Zhang et al. (2013), we realize that adding one parametric component does not affect the finite sample upper bound (3.4).

As a by-product, we obtain a non-asymptotic upper bound for MSE(β^(j)). This result is new, and also of independent interest.

Theorem 3.2

Suppose that Assumptions 3.1 – 3.3 hold. Then we have

MSE(β^(j))C1σ2n1+C2λ2+a(n,s,h,λ,ω),
(3.5)

where a(n, s, h, λ, ω) is defined in Theorem 3.1.

Again, the first term and second term in the R.H.S. of (3.5) correspond to the variance and bias, respectively. In particular, the second term comes from the bias propagation effect to be discussed in Section 3.4. By choosing λ= o(n−1/2), we can obtain the optimal rate of MSE(β^(j)), i.e., O(n−1/2), but may lose the minimax optimality of MSE(f) in most cases.

3.3. Joint Asymptotic Distribution

In this section, we derive a preliminary result on the joint limit distribution of (β^(j),f(z0)) at any z0Z. A key issue with this result is that their centering is not at the true value. However, we still choose to present it here since we will observe an interesting phenomenon when removing the bias in Section 3.4.

Theorem 3.3 (Joint Asymptotics I)

Suppose that Assumptions 3.1 and 3.2 hold, and that as N,hK~z0L2(PZ)2σz02, h12(WλA)(z0)αz0Rp, and h12A(z0)γz0Rp. Suppose the following conditions are satisfied

s=o(Nh2(ω(F,1)+logN)2),
(3.6)
s(Nh)1log4N+λ=o(h2(ω(F,1)+logN)2log2(N)).
(3.7)

Denote (β0(j),f0) as (idPλ)m0(j), where m0(j)=(β0(j),f0). We have for any z0 [set membership] Ƶ and j = 1,...,s,

  • (i)
    if s → ∞ then
    (n(β^(j)β0(j)Nh(f(z0)f0(z0)))N(0,σ2(Ω100Σ22)),
    (3.8)
    where Σ22=σz02+2γz0TΩ1αz0+γz0TΩ1γz0;
  • (ii)
    if s is fixed, then
    (n(β^(j)β0(j))Nh(f(z0)f0(z0)))N(0,σ2(Ω1s12Σ21s12Σ12Σ22)),
    (3.9)
    where Σ12=Σ12T=Ω1(αz0+γz0).

Part (i) of Theorem 3.3 says that nβ^(j) and Nhf(z0) are asymptotically independent as s → ∞. This is not surprising since only samples in one sub-population (with size n) contribute to the estimation of the heterogeneity component while the entire sample (with size N) to commonality. As n/N = s−1 → 0, the former data becomes asymptotically independent of (or asymptotically ignorable to) the latter data. So are these two estimators. The estimation bias Pλm0(j) can be removed by placing a smoothness condition on Bk, i.e., Assumption 3.3. Interestingly, given this additional condition, even when s is fixed, these two estimators can still achieve the asymptotic independence if h → 0. Please see more details in next section.

3.4. Bias Propagation

In this section, we first analyze the source of estimation bias observed in the joint asymptotics Theorem 3.3. In fact, these analysis leads to a bias propagation phenomenon, which intuitively explains how Assumption 3.3 removes the estimation bias. More importantly, we show that f shares exactly the same asymptotic distribution as for, i.e., oracle rule, when s does not grow too fast and λ is chosen in the order of N.

Our study on propagation mechanism is motivated by the following simple observation. Denote XRn×p and ZRn as the designs based on the samples from the jth sub-population and let ε(j)=[ϵi]iLjRn. The first order optimality condition (w.r.t. β) gives

β^(j)β0(j)=(XTX)1XTε(j)(XTX)1XT(f^(j)(Z)f0(Z)),
(3.10)

where f0(Z) is a n-dimensional vector with entries f0(Zi) for i [set membership] Lj and f^(j)(Z) is defined similarly. Hence, the estimation bias of β^(j) inherits from that of f(j). A more complete picture on the propagation mechanism can be seen by decomposing the total bias Pλm0(j) into two parts:

parametric bias:Lλf0=(Ω+Σλ)1B,Wλf0L2(PZ),
(3.11)
nonparametric bias:Nλf0=Wλf0ATLλf0
(3.12)

according to Proposition 2.3. The first term in (3.12) explains the bias introduced by penalization; see (2.5). This bias propagates to the parametric component through B, as illustrated in (3.11). The parametric bias Lλf0 propagates back to the nonparametric component through the second term of (3.12). Therefore, by strengthening BkL2(PZ) to BkH, i.e., Assumption 3.3, it can be shown that the order of Lλf0 in (3.11) reduces to that of λ. And then we can remove Lλf0 asymptotically by choosing a sufficiently small λ. In this case, the nonparametric bias becomes Wλf0.

We summarize the above discussions in the following theorem:

Theorem 3.4. (Joint Asymptotics II)

Suppose Assumption 3.3 and the conditions in Theorem 3.3 hold. If we choose λ=o((Nh)12n12), then

  • (i)
    if s → ∞ then
    (n(β^(j)β0(j))Nh(f(z0)f0(z0)Wλf0(z0)))N(0,σ2(Ω100Σ22)),
    (3.13)
    where Σ22=σz02+γz0TΩ1γz0;
  • (ii)
    if s is fixed, then
    (n(β^(j)β0(j))Nh(f(z0)f0(z0)Wλf0(z0)))N(0,σ2(Ω1s12Σ21s12Σ12Σ22)),
    (3.14)
    where Σ12=Σ12T=Ω1γz0 and Σ22 is the same as in (i).

Moreover, if h → 0, then Σ12=Σ21=0 and Σ22=σz02 in (i) and (ii).

The nonparametric estimation bias Wλf0(z0) can be further removed by performing undersmoothing, a standard procedure in nonparametric inference, e.g., Shang and Cheng (2013). We will illustrate this point in Section 4.

By examining the proof for case (ii) of Theorem 3.4 (and taking s = 1), we know that the oracle estimate for defined in (3.1) attains the same asymptotic distribution as that of f in (3.13) when s grows at a proper rate. Therefore, we claim that our combined estimate f satisfies the desirable oracle property.

In Section 4, we apply Theorem 3.4 to several examples, and find that even though the minimization (3.2) is based only on one fraction of the entire sample, it is nonetheless essential to regularize each sub-estimation as if it had the entire sample. In other words, λ should be chosen in the order of N. Similar phenomenon also arises in analyzing minimax optimality of each sub-estimation; see Section 3.2.

Cheng and Shang (2013) have recently uncovered a joint asymptotics phenomenon in partial smoothing spline models: parametric estimate and (point-wise) nonparametric smoothing spline estimate become asymptotically independent after the parametric bias is removed. This corresponds to a special case of Part (ii) of Theorem 3.4 for polynomially decaying kernels with s = 1 and h → ∞. Therefore, case (ii) in Theorem 3.4 generalizes this new phenomenon to the partially linear kernel ridge regression models. When h0, e.g., hr1 for finite rank kernel, the semi-nonparametric estimation in consideration essentially reduces to a parametric one. Hence, it is not surprising that the asymptotic dependence remains.

Remark 3.1

Theorem 3.4 implies that n(β^(j)β0(j))N(0,σ2Ω1) when λ = o(n−1/2). When the error [set membership] follows a Gaussian distribution, it is well known that β^(j) achieves the semiparametric efficiency bound (Kosorok, 2007). Hence, the semiparametric efficient estimate can be the obtained by applying the kernel ridge method. However, we can further improve its estimation efficiency to a parametric level by taking advantage of f (built on the whole samples). This is one important feature of massive data: strength-borrowing.

Efficiency Boosting: from semiparametric level to parametric level

The previous sections show that the combined estimate f achieves the “oracle property” in both asymptotic and non-asymptotic senses when s does not grow too fast and λ is chosen according to the entire sample size. In this section, we employ f to boost the estimation efficiency of β^(j) from semiparametric level to parametric level. This leads to our final estimate for heterogeneity, i.e., βˇ(j) defined in (3.15). More importantly, βˇ(j) possesses the limit distribution as if the commonality in each sub-population were known, and hence satisfies the “oracle rule”. This interesting efficiency boosting phenomenon will be empirically verified in Section 6.

Specifically, we define the following improved estimator for β0:

βˇ(j)=argminβRp1niLj(YiXiTβf(Zi))2.
(3.15)

Theorem 3.5 below shows that βˇ(j) achieves the parametric efficiency bound as if the nonparametric component f were known. This is not surprising given that the nonparametric estimate f now possesses a faster convergence rate after aggregation. What is truly interesting is that we need to set a lower bound for s, i.e., (3.16), and thus the homogeneous data setting is trivially excluded. This lower bound requirement slows down the convergence rate of βˇ(j), i.e., √n, such that f can be treated as if it were known.

Theorem 3.5

Suppose Assumption 3.1 and 3.2 hold. If s satisfies

s1=o(h2log4N),
(3.16)
s=o(Nh2(ω(F,1)+logN)2),
(3.17)
s(Nh)1log4N+λ=o(h2(ω(F,1)+logN)2log2N),
(3.18)

and we choose λ = o((Nh)−1), then we have

n(βˇ(j)β0(j))N(0,σ2Σ1),

where Σ=E[XXT].

Recall that X and Z are not assumed to be independent. Hence, the parametric efficiency bound Σ−1 is not larger than the semiparametric efficiency bound Ω−1.

3.6. Testing for Heterogeneity

The heterogeneity across different sub-populations is a crucial feature of massive data. However, there is still some chance that some sub-populations may share the same underlying distribution. In this section, we consider testing for the heterogeneity among sub-populations. We start from a simple pairwise testing, and then extend it to a more challenging simultaneous testing that can be applied to a large number of sub-populations.

Consider a general class of pairwise heterogeneity testing:

H0:Q(β0(j)β0(k))=0forjk,
(3.19)

where Q=(Q1T,,QqT)T is a q × q matrix with q ≥ p. The general formulation (3.19) can test either the whole vector or one fraction of β0(j) is equal to that of β0(k). A test statistic can be constructed based on either β^ or its improved version βˇ. Let CαRq be a confidence region satisfying P(bCα)=1α for any b ~ N(0, Iq). Specifically, we have the following α-level Wald tests:

Ψ1=I{Q(β^(j)β^(k))2nσ(QΩ1QT)12Cα},Ψ2=I{Q(βˇ(j)βˇ(k))2nσ(QΣ1QT)12Cα}.

The consistency of the above tests are guaranteed by Theorem 3.6 below. In addition, we note that the power of the latter test is larger than the former; see the analysis below Theorem 3.6. The price we need to pay for this larger power is to require a lower bound on s.

Theorem 3.6

Suppose that the conditions in Theorem 3.4 are satisfied. Under the null hypothesis specified in (3.19), we have

nQ(β^(j)β^(k))N(0,2σ2QΩ1QT).

Moreover, under the conditions in Theorem 3.5, we have

nQ(βˇ(j)βˇ(k))N(0,2σ2QΣ1QT),

where Σ=E[XXT].

The larger power of Ψ2 is due to the smaller asymptotic variance of βˇ(j), and can be deduced from the following power function. For simplicity, we consider H0:β01(j)β01(k)=0, i.e., Q = (1, 0, 0 ..., 0). In this case, we have Ψ1=I{β^1(j)β^1(k)>2σ[Ω1]1112zα2n}, and Ψ2=I{βˇ1(j)βˇ1(k)>2σ[Σ1]1112zα2n}. The (asymptotic) power function under the alternative that β01(j)β01(k)=β for some non-zero β* is

Power(β)=1P(W[βnσ±zα2]),

where W ~ N(0, 1) and σ* is 2σ[Ω1]1112 for Ψ1 and 2σ[Σ1]1112 for Ψ2. Hence, a smaller σ* gives rise to a larger power, and Ψ2 is more powerful than Ψ1. Please see Section 6 for empirical support for this power comparison.

We next consider a simultaneous testing that is applied to a large number of sub-populations:

H0:β(j)=β~(j)for alljG,
(3.20)

where G{1,2,,s}, versus the alternative:

H1:β(j)β~(j)for somejG.
(3.21)

The above β^(j)'s are pre-specified for each jG. If all β~(j)'s are the same, then it becomes a type of heterogeneity test for the group of sub-populations indexed by G. Here we allow G to be as large as s, and thus it can increase with n. Let Σ^(j) be the sample covariance matrix of X for the j-th sub-population, i.e., n1iLjXiXiT. Define the test statistic

TGmaxjG,1kpn(βˇk(j)β~k(j)).

We approximate the distribution of the above test statistic using multiplier bootstrap. Define the following quantity:

WGmaxjG,1kp1niLj(Σ^(j))k1Xiei,

where ei's are i.i.d. N(0, σ2) independent of the data and (Σ^(j))k1 is the k-th row of (Σ^(j))1. Let cG(α)=inf{tR:P(WGtX)1α}. We employ the recent Gaussian approximation and multiplier bootstrap theory (Chernozhukov et al., 2013) to obtain the following theorem.

Theorem 3.7

Suppose Assumptions 3.1 and 3.2 hold. In addition, suppose (3.17) and (3.18) in Theorem 3.5 hold. For any G{1,2,,s} with G=d, if (i) sh2log(pd)log4N, (ii) (log(pdn))7nC1nc1 for some constants c1, C1 > 0, and (iii) p2log(pd)n=o(1), then under H0 and choosing λ = o((Nh)−1), we have

supα(0,1)P(TG>cG(α))α=o(1).

Remark 3.2

We can perform heterogeneity testing even without specifying β~(j)'s. This can be done by simply reformulating the null hypothesis as follows (for simplicity we set G=[s]): H0 : α(j) = 0 for j [set membership] [s – 1], where α(j) = β(j)β(j+1) for j = 1,...,s – 1. The test statistic is TG=max1js1max1kpαk(j). The bootstrap quantity is defined as

WGmax1js1,1kp1niLj(Σ^(j))k1Xiei1niLj+1(Σ^(j+1))k1Xiei.

The proof is similar to that of Theorem 3.7 and is omitted.

4. Examples

In this section, we consider three specific classes of RKHS with different smoothness, characterized by the decaying rate of the eigenvalues: finite rank, exponential decay and polynomial decay. In particular, we give explicit upper bounds for s under which the combined estimate enjoys the oracle property, and also explicit lower bounds for obtaining efficiency boosting studied in Section 3.5. Interestingly, we find that the upper bound for s increases for RKHS with faster decaying eigenvalues. Hence, our aggregation procedure favors smoother regression functions in the sense that more sub-populations are allowed to be included in the observations. The choice of λ is also explicitly characterized in terms of the entire sample size and the decaying rate of eigenvalues. In all three examples, the undersmoothing is implicitly assumed for removing the nonparametric estimation bias. Our bounds on s and λ here are not the most general ones. Rather, we present the bounds that have less complicated forms but are still sufficient to deliver theoretical insights.

4.1. Example I: Finite Rank Kernel

The RKHS with finite rank kernels includes linear functions, polynomial functions, and, more generally, functional classes with finite dictionaries. In this case, the effective dimension is simply proportional to the rank r. Hence, hr1. Combining this fact with Theorem 3.4, we get the following corollary for finite rank kernels:

Corollary 4.1

Suppose Assumption 3.1 – 3.3 hold and s → ∞. For any z0Z, if λ = o(N−1/2), log(λ−1) = o(N2log−12N) and s=o(Nlogλ1log6N), then

(n(β^(j)β0(j))N(f(z0)f0(z0)))N(0,σ2(Ω100Σ22)),

where Σ22==1rϕ(z0)2+γz0TΩ1γz0 and γz0==1rB,ϕL2(PZ)ϕ(z0).

From the above Corollary, we can easily tell that the upper bound for s can be as large as o(N log−7 N) by choosing a sufficiently large λ. Hence, s can be chosen nearly as large as N. As for the lower bound of s for boosting the efficiency, we have sr2log4N by plugging hr1 into (3.16). This lower bound is clearly smaller than the upper bound. Hence, the efficiency boosting is feasible.

Corollary 4.2 below specifies conditions and s and λ under which f achieves the nonparametric minimaxity.

Corollary 4.2

Suppose that Assumption 3.1 - 3.3 hold. When λ = r/N and s = o(N log−5 N), we have

E[ff0L2(PZ)2]CrN,

for some constant C.

4.2. Example II: Exponential Decay Kernel

We next consider the RKHS for which the kernel has exponentially decaying eigenvalues, i.e., μ=exp(αp) for some α > 0. In this case, we have h(logλ1)1p by explicit calculations.

Corollary 4.3

Suppose Assumption 3.1 – 3.3 hold, and for any z0Z,f0H satisfies =1ϕ(z0)f0,ϕH<. If λ=o(N12log1(2p)Nn12), log(λ1)=o(Np(p+4)log6p(p+4)N) and s=o(Nlog6Nlog(p+4)pλ1), then

(n(β^(j)β0(j))Nh(f(z0)f0(z0))N(0,σ2(Ω100σz02)),

where σz02=limNh=1ϕ2(z0)(1+λμ)2.

Corollary 4.3 implies the shrinking rate of the confidence interval for f0(z0) as (Nh)−1/2. This motivates us to choose λ (equivalently h) as large as possible. Plugging such a λ into the upper bound of s yields s=o(Nlog(7p+4)pN). For example, when p = 1(p = 2), the upper bound is s=o(Nlog11N)(s=o(Nlog9N)). Note that this upper bound for s only differs from that for the finite rank kernel up to some logrithmic term. This is mainly because RKHS with exponentially decaying eigenvalues has an effective dimension (log N)1/p (for the above λ). Again, by (3.16) we get the lower bound of s(logλ1)2plog2 When λN12log1(2p)Nn12, it is approximately slog(4p+2)pN.

As a concrete example, we consider the Gaussian kernel K(z1, z2) = exp (–|z1z2|2/2). The eigenfunctions are given in (2.1), and the eigenvalues are exponentially decaying, as μ=η2+1, where η = √5–1)/2. According to Krasikov (2004), we can get that

cϕ=supNϕsup2e158(54)1232π2161.336.

Thus, Assumption 3.2 is satisfied. We next give an upper bound of σz02 in Corollary 4.3 as follows:

σz02limNσ2cϕ2h=0(1+ληexp(2(logη)))2=cϕ22σ2log(1η)1.7178σ2,

where equality follows from Lemma C.1 in Appendix C with the case t = 2. Hence, a (conservative) 100(1–α)% confidence interval for f0(z0) is given by f(z0) ± 1.3106σzα/2/√Nh.

Corollary 4.4

Suppose that Assumption 3.1 – 3.3 hold. By choosing λ = log1/p N/N and s = o(N log–(5p+3)/p N), we have

E[ff0L2(PZ)2]C(logN)1pN.

We know that the above rate is minimax optimal according to Zhang et al. (2013). Note that the upper bound for s required here is similar as that for obtaining the joint limiting distribution in Corollary 4.3.

4.3. Example III: Polynomial Decay Kernel

We now consider the RKHS for which the kernel has polynomially decaying eigenvalues, i.e., μ=c2ν for some ν > 1/2. Hence, we can explicitly calculate that h = λ1/(2ν). The resulting penalized estimate is called as “partial smoothing spline” in the statistics literature; see Gu (2013); Wang (2011).

Corollary 4.5

Suppose Assumption 3.1 – 3.3 hold, and =1ϕ(z0)f0,ϕH< for any z0Z and f0H. For any ν>1+321.866,ifλNd for some 2ν4ν+1<d<4ν210ν1,λ=o(n12) and s=o(λ10ν14ν2Nlog6N), then

(n(β^(j)β0(j))Nh(f(z0)f0(z0)))N(0,σ2(Ω100σz02)).

where σz02=limNh=1ϕ2(z0)(1+λμ)2.

Similarly, we choose λN2ν4ν+1n12 to get the fastest shrinking rate of the confidence interval. Plugging the above λ into the upper bound for s, we get

s=o(N8ν28ν+12ν(4ν+1)log6NN(logN)48ν28ν2+10ν+1).

When N is large, the above bound reduces to s=o(N8ν28ν+12ν(4ν+1)log6N). We notice that the upper bound for s increases as ν increases, indicating that the aggregation procedure favors smoother functions. As an example, for the case that ν = 2, we have the upper bound for s = o(N17/36 log−6 N) ≈ o(N0.47 log−6 N). Again, we obtain the lower bound sλ1νlog4N by plugging hλ12ν into (3.16). When λN2ν4ν+2, we get sN14ν+1log2N. For ν = 2, this is approximately sN0.22log4N.

As a concrete example, we consider the periodic Sobolev space H0ν[0,1] with the following eigenfunctions:

ϕ(x)={1,=0,2cos(πx),=2kfork=1,2,,2sin((+1)πx),=2k1fork=1,2,,}
(4.1)

and eigenvalues

μ={,=0,(π)2ν,=2kfork=1,2,,((+1)π)2ν,=2k=1fork=1,2,,}
(4.2)

Hence, Assumption 3.2 trivially holds. Under the above eigensystem, the following lemma gives an explicit expression of σz02.

Lemma 4.1

Under the eigen-system defined by (4.1) and (4.2), we can explicitly calculate:

σz02=limNh=1ϕ2(z0)(1+λμ)2=01(1+x2ν)2dx=π2νsin(π(2ν)).

Therefore, by Corollary 4.5, we have that when λN2ν4ν+1 and s=o(N8ν28ν+12ν(4ν+1)log6N),

(n(β^(j)β0(j))Nh(f(z0)f0(z0)))N(0,σ2(Ω100σz02)).
(4.3)

where σz02 is given in Lemma 4.1. When ν=2,λN49 and the upper bound for s = o(N17/36 log−6 N).

Corollary 4.6

Suppose that Assumption 3.1 - 3.3 hold. If we choose λ=N2ν2ν+1, and s=o(N4ν24ν+14ν2+2νlog4N), the combined estimator achieves optimal rate of convergence, i.e.,

E[ff0L2(PZ)2]CN2ν2ν+1.
(4.4)

The above rate is known to be minimax optimal for the class of functions in consideration (Stone, 1985).

5. Application to Homogeneous Data: Divide-and-Conquer Approach

In this section, we apply the divide-and-conquer approach, which is commonly used to deal with massive homogeneous data, to some sub-populations that have huge sample sizes. A general goal of this section is to explore the most computationally efficient way to split the sample in those sub-populations while preserving the best possible statistical inference. Specifically, we want to derive the largest possible number of splits under which the averaged estimators for both components enjoy the same statistical performances as the “oracle” estimator that is computed based on the entire sample. Without loss of generality, we assume the entire sample to be homogeneous by setting all β0(j)'s to be equal throughout this section.

The divide-and-conquer method randomly splits the massive data into s mutually exclusive subsamples. For simplicity, we assume all the subsamples share the same sample size, denoted as n. Hence, N = n × s. With a bit abuse of notation, we define the divide-and-conquer estimators as β^(j) and f^(j) when they are based on the j-th subsample. Thus, the averaged estimator is defined as

β=(1s)j=1sβ^(j)andf()=(1s)j=1sf^(j)().

Comparing to the oracle estimator, the aggregation procedure reduces the computational complexity in terms of the entire sample size N to the sub-sample size N/s. In the case of kernel ridge regression, the complexity is O(N3), while our aggregation procedure (run in one single machine) reduces it to O(N3/s2). Propositions 5.1 and 5.2 below state conditions under which the divide-and-conquer estimators maintain the same statistical properties as oracle estimate, i.e., so-called oracle property.

Our first contribution is a non-asymptotic upper bound for MSE(f).

Proposition 5.1

Suppose that the conditions in Theorem 3.1 hold. We have that the divide and conquer estimator f satisfies

MSE(f)C1σ2(Nh)1+2f0H2λ+C2λ2+s1a(n,s,h,λ,ω),
(5.1)

where a(n, s, h, λ, ω), C1 and C2 are constants defined in Theorem 3.1.

Our second contribution is on the joint asymptotic distribution under the same conditions for (s, λ) required in the heterogeneous data setting.

Proposition 5.2

Suppose that the conditions in Theorem 3.4 hold. If we choose λ = o(N−1/2), then

(N(ββ0)Nh(f(z0)f0(z0)Wλf0(z0)))N(0,(σ2Ω2Σ12Σ21Σ22)),

where Σ12=Σ21T=σ2Ω1γz0 and Σ22=σ2(σz02+γz0TΩ1γz0). Moreover, if h → 0, then γz0 = 0. In this case, Σ12=Σ21T=0 and Σ22=σ2σz02.

The conclusion of Proposition 5.2 holds no matter s is fixed or diverges (once the condition for s in Theorem 3.4 are satisfied).

In view of Propositions 5.1 and 5.2, we note that the above upper bound and joint asymptotic distribution are exactly the same as those for the oracle estimate, i.e., s = 1.

6. Numercial Experiment

In this section, we empirically examine the impact of the number of sub-populations on the statistical inference built on (β^(j),f). As will be seen, the simulation results strongly support our general theory.

Specifically, we consider the partial smoothing spline models in Section 4.3. In the simulation setup, we let ε ~ N(0,1), p = 1 and ν = 2 (cubic spline). Moreover Z ~ Uniform (−1, 1) and X = (W + Z)/2, where W ~ Unfiform(−1,1), such that X and Z are dependent. It is easy to show that Ω = E[(X – E[X|Z])2)] = 1/12 and Σ = E[X2] = 1/6. To design that heterogeneous data setting, we let β0(j)=j for j = 1,2,...,s on the j-th subpopulation. The nonparametric function f0(z), which is common across all subpopulations, is assumed to be 0.6b30,17(z) + 0.4b3,11, where bα1,α2 is the density function for Beta(α1, α2).

We start from the 95% predictive interval (at (x0, z0)) implied by the joint asymptotic distribution (4.3):

[Y^(j)±1.96σx0TΩ1x0n+σz02(Nh)+1],

where Y^(j)x0Tβ^(j)+f(z0) is the predicted response. The unknown error variance σ is estimated by (σ^(j))2=n1iLj(YiXiTβ^(j)f^(j)(Zi))2(nTr(A(λ))), where A(λ) denotes the smoothing matrix, followed by an aggregation σ2=1sj=1s(σ^(j))2. In the simulations, we fix x0 = 0.5 and choose z0 = 0.25, 0.5, 0.75 and 0.95. The coverage probability is calculated based on 200 repetitions. As for N and s, we set N = 256, 528, 1024, 2048, 4096, and choose s = 20, 21,...,2t–3 when N = 2t. The simulation results are summarized in Figure 1. We notice an interesting phase transition from Figure 1: when ss* where s* ≈ N0.45, the coverage probability is approximately 95%; when ss*, the coverage probability drastically decreaes. This empirical observation is strongly supported by our theory developed in Section 4.3 where s* ≈ N0.42 log−6N for ν = 2.

Fig 1
Coverage probability of 95% predictive interval with different choices of s and N

We next compute the mean-squared errors of f under different choices of N and s in Figure 2. It is demonstrated that the increasing trends of MSE as s increases are very similar for different N. More importantly, all the MSE curves suddenly blow up when sN0.4. This is also close to our theoretical result that the transition point is around N0.45 log−6 N.

Fig 2
Mean-square errors of f under different choices of N and s

We next empirically verify the efficiency boosting theory developed in Section 3.5. Based on β^(j) and βˇ(j), we construct the following two types of 95% confidence intervals for β0(j).

CI1=[β^(j)±1.96Ω12n1.2σ],CI2=[βˇ(j)±1.96Σ12n1.2σ].

Obviously, CI2 is shorter than CI1. However, Theorem 3.5 shows that CI2 is valid only when s satisfies both a upper bound and a lower bound. This theoretical condition is empirically verified in Figure 3 which exhibits the validity range of CI2 in terms of s. In Figure 4, we further compare CI2 and CI1 in terms of their coverage probabilities and lengths. This figure shows that when s is in a proper range, the coverage probabilities of CI1 and CI2 are similar, while CI2 is significantly shorter.

Fig 3
Coverage probability of 95% confidence interval based on βˇ
Fig 4
Covergae probabilities and average lengths of 95% confidence intervals constructed based on β^ and βˇ. In the above figures, dashed lines represent CI1, which is constructed based on βˇ, and solid lines represent ...

Lastly, we consider the heterogeneity testing. In Figure 5, we compare tests Ψ1 and Ψ2 under different choices of N and s ≥ 2. Specifically, Figure 5 (i) compares the nominal levels, while Figure 5 (ii) - (iv) compare the powers under various alternative hypotheses H1:β0(j)β0(k)=Δ, where Δ = 0.5, 1, 1.5. It is clearly seen that both tests are consistent, and their powers increase as Δ or N increases. In addition, we observe that Ψ2 has uniformly larger powers than Ψ1.

Fig 5
(i) Nominal level of heterogeneity tests Ψ1 and Ψ2; (ii) - (iv) Power of heterogeneity tests Ψ1 and Ψ2 when Δ= 0.5, 1.0, 1.5. In the above figures, dashed lines represent Ψ1, which is constructed based on ...

7. Proof of Main Results

In this section, we present main proofs of Theorem 3.1, 3.3 and 3.4 in the main text.

7.1. Proof of Theorem 3.1

Proof

We start from analyzing the minimization problem (3.2) on each sub-population. Recall m = (β,f) and U = (X, Z). The objective function can be rewritten as

1niLj(YiXiTβf(Zi))2+λfH2=1niLj(Yim(Ui))2+Pλm,mA=1niLj(YiRUi,mA)2+Pλm,mA

The first order optimality condition (w.r.t. Fréchet derivative) gives

1niLjRUi(m^(j)(Ui)Yi)+Pλm^(j)=0,

where m^(j)=(β^(j),f^(j)). This implies that

1niLjRUiεi+1niLjRUi(m^(j)(Ui)m0(j)(Ui))+Pλm^(j)=0,

where m0(j)=(β0(j),f0). Define Δm(j)m^(j)m0(j). Adding EU[RUΔm(j)(U)] on both sides of the above equation, we have

EU[RUΔm(j)(U)]+PλΔm(j)=1niLjRUiεiPλm0(j)1niLj(RUiΔm(j)(Ui)EU[RUΔm(j)(U)]).
(7.1)

The L.H.S. of (7.1) can be rewritten as

EU[RUΔm(j)(U)]+PλΔm(j)=EU[RURU,Δm(j)A]+PλΔm(j)=(EU[RURU]+Pλ)Δm(j)=Δm(j),

where the last equality follows from proposition 2.2. Then (7.1) becomes

m^(j)m0(j)=1niLjRUiεiPλm0(j)1niLj(RUiΔm(j)(Ui)EU[RUΔm(j)(U)]).
(7.2)

We will show that the first term in the R.H.S. of (7.2) weakly converges to a normal distribution, the second term contributes to the estimation bias, and that the last term is an asymptotically ignorable remainder term. We denote the last term as Rem(j)1niLj(RUiΔm(j)(Ui)EU[RUΔm(j)(U)]). Recall that Ru = (Lu, Nu) and Pλm0(j)=(Lλf0,Nλf0). Thus the above remainder term decomposes into two components:

Remβ(j)1niLj(LUiΔm(j)(Ui)EU[LUΔm(j)(U)])Remf(j):=1niLj(NUiΔm(j)(Ui)EU[NUΔm(j)(U)]).

Similarly, (7.2) can be rewritten into the following two equations:

β^(j)β0(j)=1niLjLUiεiLλf0Remβ(j),
(7.3)

and

f^(j)f0=1niLjNUiεiNλf0Remf(j),
(7.4)

for all j = 1,...,s. Taking average of (7.4) for all j over s, and by definition of f, we have

ff0=1Ni=1NNUiεiNλf01sj=1sRemf(j),
(7.5)

where we used 1Ni=1NNUiεi=1sj=1s1niLjNUiεi. By (7.5), it follows that

E[ff0C2]3E[1Ni=1NNUiεiC2]+3Nλf0C2+3E[1sj=1sRemf(j)C2].
(7.6)

By Lemma A.5 and the fact that each NUiεi is i.i.d., it follows that

E[1Ni=1NNUiεiC2]=1NE[NUεC2]C1σ21Nh,
(7.7)

and

Nλf0C22f0H2λ+C2λ2,
(7.8)

where C1 and C2 are constants specified in Lemma A.5.

As for the third term in (7.6), we have by independence across sub-populations that

E[1sj=1sRemf(j)C2]=1s2j=1sE[Remf(j)C2].
(7.9)

Therefore it suffices to bound E[Remf(j)c2]. We have the following lemma that controls this term:

Lemma 7.1

Suppose Assumptions 3.1, 3.2 and Condition (3.6) hold. We have for all j = 1,...,s

E[Rem(j)A2]a(n,s,h,λ,ω),

for sufficiently large n. Moreover, the inequality also holds for E[Remβ(j)22] and E[Remf(j)C2]

Combining (7.6) - (7.9) and Lemma 7.1, and by the fact that ff0L2(PZ)2ff0C2, we complete the proof of Theorem 3.1.

7.2. Proof of Theorem 3.3

Proof

Recall that m0(j)=(β0(j),f0)=(idPλ)m0(j) where m0(j)=(β0(j),f0). This implies that β0(j)=β0(j)Lλf0 and (7.5), for arbitrary x and z0,

(xT,1)(n(β^(j)β0(j))Nh(f(z0)f0(z0)))=nxT(β^(j)β0(j))+Nh(fN,λ(z0)f0(z0))=1niLjxTLUiεi+1Ni=1Nh12NUi(z0)εi(I)+nxTRemβ(j)+Nhs1j=1sRemf(j)(z0)(II).

In what follows, we will show that the main term (I) is asymptotically normal and the remainder term (II) is of order oP(1). Given that x is arbitrary, we apply Wold device to conclude the proof of joint asymptotic normality.

Asymptotic normality of (I)

We present that result for showing asymptotic normality of (I) in the following lemma and defer its proof to supplemental material.

Lemma 7.2.

Suppose that as N,hK~z0L2(PZ)2σz02, h12(WλA)(z0)az0Rp and h12A(z0)γz0Rp. We have

  • (i)
    if s → ∞, then
    (I)N(0,σ2(xTΩ1x+Σ22)).
    (7.10)
  • (ii) if s is fixed, then
    (I)N(0,σ2(xTΩ1x+Σ22+2s12xTΣ12)).
    (7.11)

Control of the remainder term (II)

We now turn to bound the remainder term (II). We need the following lemma:

Lemma 7.3

Suppose Assumption 3.1, 3.2 and Condition (3.6) hold. We have the following two sets of results that control the remainder terms:

  • (i)
    For all j = 1,...,s
    Rem(j)A=oP(bn,s).
    where bn,s=Ch1n12rn,s(ω(F,1)+logN) and rn,s=(logN)2(nh)12+λ12. Also, Remβ(j)2=oP(bn,s) and Remf(j)C=oP(bn,s).
  • (ii)
    Moreover, we have
    1sj=1sRem(j)A=oP(sw12bn,slogN)1sj=1sRemβ(j)2=oP(s12bn,slogN)1sj=1sRemf(j)C=oP(s12bn,slogN).

By Lemma 7.3, we have

nxTRemβ(j)nx2Remβ(j)2=oP(n12bn,s)=op(Ns12bn.s),
(7.12)

where we used the boundedness of x. Also,

Nhs1j=1sRemf(j)(z0)NhK~z0Cs1j=1sRemf(j)CNs1j=1sRemf(j)C=oP(Ns12bn,slogN),
(7.13)

where the second inequality follows from Lemma A.4. Therefore by (7.12) and (7.13), we have

(II)=oP(Ns12bn,slogN).
(7.14)

Now by definition of bn,s and condition (3.7), we have (II) = oP (1). Combining (7.10) and (7.14), it follows that if s → ∞, then

(xT,1)(n(β^(j)β0(j))Nh(f(z0)f0(z0)))N(0,σ2(xTΩ1x+Σ22)).

Combining (7.11) and (7.14), it follows that if s is fixed, then

(xT,1)(n(β^(j)β0(j))Nh(f(z0)f0(z0)))N(0,σ2(xTΩ1x+Σ22+2s12xTΣ12)).

By the arbitrariness of x, we reach the conclusion of the theorem using Wold device.

7.3. Proof of Lemma 7.3: Controlling the Remainder Term

Proof

(i) We first derive the bound of Rem(j)A. Recall

Rem(j)=1niLjΔm(j)(Ui)RUiEU[Δm(j)(U)RU].

Let Zn(m)cr1h12n12iLj{m(Ui)RUiE[m(U)RU]}, where cr is the constant specified in Lemma A.4. Note that Zn(m) is implicitly related to j but we omit the superscript of (j). We have Rem(j)=cr1nhZn(Δm(j)). We apply Lemma F.1 to obtain an exponential inequality for supmFZn(m)A. The first step is to show that Zn(m) is a sub-Gaussian process by Lemma G.1. Let g(Ui,m)=cr1nh(m(Ui)RUiE[m(U)RU]). Now for any m1 and m2,

g(Ui,m1)g(Ui,m2)A=cr1nh{(m1(Ui)m2(Ui))RUiA+𝔼[(m1(U)m2(U))RU]A}2nm1m2sup,

where we used the fact that RuAcrh12 by Lemma A.4. Note that Zn(m)=1niLjg(Ui,m). Therefore by Lemma G.1, we have for any t > 0,

P(Zn(m1)Zn(m2)At)=P(1ni=1n{g(Ui,m1)g(Ui,m2)}At)2exp(t28m1m2sup2)
(7.15)

Then by Lemma F.1, we have

P(supmFZn(m)ACω(F,diam(F))+x)Cexp(x2Cdiam(F)2),
(7.16)

where diam(F)=supm1,m2Fm1m2sup.

Define dn,s = crrn,sh−1/2 and m~=dn,s1Δm(j)2. Again we do not specify its relationship with j. Define the event E={Δm(j)Arn,s}. On the event E, we have

m~supcrh12(2dn,s)1Δm(j)A12,

where we used the fact that m~supcrh12m~A by Lemma A.4. This implies xTβ~+f~(z)12 for any (x, z). Letting x = 0, one gets f~sup12, which further implies xTβ~1 for all x by triangular inequality. Moreover, on the even E we have

f~Hλ12m~Aλ12(2dn,s)Δm(j)Acr1h12λ12

by the definition of A. Hence, we have shown that E{m~F}. Combining this fact with (7.16), and noting that diam(F)=1, we have

P({Zn(m~)ACω(F,1)+x}E)Cexp(x2C),
(7.17)

By the definition of m, and the relationship Rem(j)=cr1nhZn(Δm(j)), we calculate that Zn(m~)=(12)h12n12dn,s1Rem(j)=(12)cr1hn12rn,s1Rem(j). Plugging the above form of Zn(m) into (7.17) and letting x = log N in (7.17), we have

P({Rem(j)Abn,s}E)Cexp(log2NC),
(7.18)

where we used the definition that bn,s=Ch1n12rn,s(ω(F,1)+logN). Therefore we have

P(Rem(j)Abn,s)P({Rem(j)Abn,s}E)+P(Ec)Cexp(log2NC)+P(Ec).
(7.19)

We have the following lemma that controls P(Ec).

Lemma 7.4

If Assumption 3.1, 3.2 and Condition (3.6) are satisfied, then there exist a constant c such that

P(Ec)=P(Δm(j)Arn,s)nexp(clog2N).

for all j = 1,...,s.

By Lemma 7.4 and (7.19) we have

P(Rem(j)Abn,s)nexp(clog2N).
(7.20)

(ii) We will use an Azuma-type inequality in Hilbert space to control the averaging remainder term s1j=1sRem(j), as all Rem(j) are independent and have zero mean. Define the event Aj={Rem(j)Abn,s}. By Lemma G.1, we have

P({jAj}{s1j=1sRem(j)A>s12bn,slogN})2exp(log2N2).
(7.21)

Moreover, by (7.20),

P(Ajc)nexp(clog2N).
(7.22)

Hence it follows that

P(s1j=1sRem(j)A>s12bn,slogN)P({j=1sAj}{s1j=1sRem(j)A>s1bn,slogN})+P(jAjc)

as N → ∞, where the last inequality follows from (7.21), (7.22) and union bound. This finishes the proof.

We can apply similar arguments as above to bound Remf(j)C and 1sj=1sRemf(j)C, by changing ω(F,1) to ω(F2,1), which is dominated by ω(F,1). The bounds of Remβ(j)2 and 1sj=1sRemβ(j)2 then follow from triangular inequality.

7.4. Proof of Theorem 3.4

Proof

In view of Theorem 3.4, we first prove

(n(β0(j)β0(j))Nh(f0(z0)f0(z0)Wλf0(z0)))0
(7.23)

for both (i) and (ii). By Proposition 2.3, we have

(β0(j)β0(j)f0(z0)f0(z0))=(Lλf0Wλf0(z0)+A(z0)TLλf0).
(7.24)

By Lemma A.5, it follows that under Assumption 3.3, Lλf02λ. Now we turn to f0(z0)f0(z0). Observe that

A(z)=A,K~zC=B,K~zL2(PZ)==1B,ϕL2(PZ)1+λμϕ(z),
(7.25)

Applying Cauchy-Schwarz, we obtain

Ak(z0)2(=1Bk,ϕL2(PZ)2μϕ2(z0))(=1μ(1+λμ)2)cϕ2BkH2Tr(K),

where the last inequality follows from the uniform boundedness of ϕ. Hence we have that Ak(z0) is uniformly bounded, which implies

A(z0)TLλf0A(z0)2Lλf02λ.

Therefore, if we choose λ=o((Nh)12n12), then we get (7.23), which eliminates the estimation bias for β0(j).

Now we consider the asymptotic variance for cases (i) and (ii). It suffices to show that αz0=limNh12WλA(z0). By Lemma A.2 and (7.25), we have

WλAk(z0)==1Bk,ϕL2(PZ)1+λμλλ+μϕ(z0)(=1Bk,ϕL2(PZ)2μϕ2(z0))(=1μ(1+λμ)2)cϕ2BkH2Tr(K).

Hence by dominated conference theorem, as λ → 0 we have WλAk(z0)0. As h = O(1), it follows that αz0=limNh12WλA(z0)=0.

When h → 0, we have γz0=limNh12A(z0)=0, as Ak(z0) is uniformly bounded. Hence Σ12=Σ21=0 and Σ22=σ2σz02.

Supplementary Material

01

Acknowledgments

Research Sponsored by NSF IIS1408910, NSF IIS1332109, NIH R01MH102339, NIH R01GM083084, and NIH R01HG06841

Research Sponsored by NSF CAREER Award DMS-1151692, DMS-1418042, Simons Foundation 305266. Guang Cheng was on sabbatical at Princeton while part of this work was carried out; he would like to thank the Princeton ORFE department for its hospitality.

Footnotes

AMS 2000 subject classifications: Primary 62G20, 62F25; secondary 62F10, 62F12

1The commonality estimator fN,λ can be adjusted as a weighted sum j=1s(njN)f^n,λ(j) if sub-sample sizes are different. In particular, the divide-and-conquer method can be applied to those sub-populations with huge sample sizes; see Section 5.

SUPPLEMENTARY MATERIAL

Supplementary material for: A Partially Linear Framework for Massive Heterogenous Data (DOI: To Be Assigned; .pdf). We provide the detailed proof in the supplement.

References

  • Aitkin M, Rubin DB. Estimation and hypothesis testing in finite mixture models. Journal of the Royal Statistical Society. Series B (Methodological) 1985:67–75.
  • Bach F. Sharp analysis of low-rank kernel matrix approximations. arXiv preprint arXiv. 2012:1208.2015.
  • Berlinet A, Thomas-Agnan C. Reproducing kernel Hilbert spaces in probability and statistics. Vol. 3. Springer; 2004.
  • Birman MŠ, Solomjak M. Piecewise-polynomial approximations of functions of the classes img align= absmiddle alt= w_p^ α tex_sm_2343_img1/img. Sbornik: Mathematics. 1967;2:295–317.
  • Carl B, Triebel H. Inequalities between eigenvalues, entropy numbers, and related quantities of compact operators in banach spaces. Mathematische Annalen. 1980;251:129–133.
  • Chen X, Xie M. Tech. rep., Technical Report 2012-01. Dept. Statistics, Rutgers Univ; 2012. A split-and-conquer approach for analysis of extraordinarily large data.
  • Cheng G, Shang Z. Joint asymptotics for semi-nonparametric models under penalization. arXiv. 2013:1311.2628.
  • Chernozhukov V, Chetverikov D, Kato K, et al. Gaussian approximations and multiplier bootstrap for maxima of sums of high-dimensional random vectors. The Annals of Statistics. 2013;41:2786–2819.
  • Fan J, Zhang W. Statistical estimation in varying coefficient models. Annals of Statistics. 1999:1491–1518.
  • Figueiredo MA, Jain AK. Unsupervised learning of finite mixture models. Pattern Analysis and Machine Intelligence, IEEE Transactions on. 2002;24:381–396.
  • Gu C. Smoothing spline ANOVA models. Vol. 297. Springer; 2013.
  • Guo W. Inference in smoothing spline analysis of variance. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 2002;64:887–898.
  • Hastie T, Tibshirani R. Varying-coefficient models. Journal of the Royal Statistical Society. Series B (Methodological) 1993:757–796.
  • Huang J, Zhang T. The benefit of group sparsity. The Annals of Statistics. 2010;38:1978–2004.
  • Kleiner A, Talwalkar A, Sarkar P, Jordan M. The big data bootstrap. arXiv preprint arXiv. 2012:1206.6415.
  • Kosorok MR. Introduction to empirical processes and semiparametric inference. Springer; 2007.
  • Krasikov I. New bounds on the hermite polynomials. arXiv preprint math/0401310. 2004
  • Lafferty JD, Lebanon G. Diffusion kernels on statistical manifolds. 2005
  • Mammen E, van de Geer S. Penalized quasi-likelihood estimation in partial linear models. The Annals of Statistics. 1997:1014–1035.
  • McDonald R, Hall K, Mann G. Distributed training strategies for the structured perceptron. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics. Association for Computational Linguistics. 2010
  • McLachlan G, Peel D. Finite mixture models. John Wiley & Sons; 2004.
  • Meinshausen N, Bühlmann P. Maximin effects in inhomogeneous large-scale data. arXiv preprint arXiv. 2014:1406.0596.
  • Mendelson S. In Computational Learning Theory. Springer; 2002. Geometric parameters of kernel machines.
  • Nardi Y, Rinaldo A. On the asymptotic properties of the group lasso estimator for linear models. Electronic Journal of Statistics. 2008;2:605–633.
  • Obozinski G, Wainwright MJ, Jordan MI. Union support recovery in high-dimensional multivariate regression.. Communication, Control, and Computing, 2008 46th Annual Allerton Conference on; IEEE; 2008.
  • Pinelis I. Optimum bounds for the distributions of martingales in banach spaces. The Annals of Probability. 1994:1679–1706.
  • Raskutti G, Wainwright MJ, Yu B. Early stopping and non-parametric regression: an optimal data-dependent stopping rule. The Journal of Machine Learning Research. 2014;15:335–366.
  • Shang Z, Cheng G. Local and global asymptotic inference in smoothing spline models. The Annals of Statistics. 2013;41:2608–2638.
  • Shawe-Taylor J, Cristianini N. Kernel methods for pattern analysis. Cambridge university press; 2004.
  • Sollich P, Williams CK. Deterministic and Statistical Methods in Machine Learning. Springer; 2005. Understanding gaussian process regression using the equivalent kernel. pp. 211–228.
  • Städler N, Bühlmann P, van de Geer S. #21131-penalization for mixture regression models. Test. 2010;19:209–256.
  • Steinwart I, Hush DR, Scovel C, et al. Optimal rates for regularized least squares regression. COLT; 2009.
  • Stewart GW, Sun J.-g. Matrix perturbation theory. 1990
  • Stone CJ. Additive regression and other nonparametric models. The annals of Statistics. 1985:689–705.
  • Tropp JA. User-friendly tail bounds for sums of random matrices. Foundations of Computational Mathematics. 2012;12:389–434.
  • Tsybakov AB, Zaiats V. Introduction to nonparametric estimation. Vol. 11. Springer; 2009.
  • Van Der Vaart AW, Wellner JA. Weak Convergence. Springer; 1996.
  • van Handel R. Probability in high dimension: Lecture notes. 2014
  • Wahba G. Spline models for observational data. Vol. 59. Siam; 1990.
  • Wang X, Dunson DB. Parallel mcmc via weierstrass sampler. arXiv preprint arXiv. 2013:1312.4605.
  • Wang Y. Smoothing splines: methods and applications. CRC Press; 2011.
  • Wasserman L. Steins method and the bootstrap in low and high dimensions: A tutorial. 2014
  • Williamson RC, Smola AJ, Scholkopf B. Generalization performance of regularization networks and support vector machines via entropy numbers of compact operators. Information Theory, IEEE Transactions on. 2001;47:2516–2532.
  • Yang Y, Barron A. Information-theoretic determination of minimax rates of convergence. Annals of Statistics. 1999:1564–1599.
  • Zhang T. Learning bounds for kernel regression using effective data dimensionality. Neural Computation. 2005;17:2077–2098. [PubMed]
  • Zhang Y, Duchi J, Wainwright M. Divide and conquer kernel ridge regression. Conference on Learning Theory. 2013