PMCCPMCCPMCC

Search tips
Search criteria 

Advanced

 
Logo of nihpaAbout Author manuscriptsSubmit a manuscriptHHS Public Access; Author Manuscript; Accepted for publication in peer reviewed journal;
 
J Mach Learn Res. Author manuscript; available in PMC 2017 March 16.
Published in final edited form as:
J Mach Learn Res. 2015 August; 16: 1579–1606.
PMCID: PMC5354374
NIHMSID: NIHMS752268

Calibrated Multivariate Regression with Application to Neural Semantic Basis Discovery*

Han Liu
Han Liu, Department of Operations Research and Financial Engineering, Princeton University, NJ 08544, USA, ude.notecnirp@uilnah;
Lie Wang
Lie Wang, Department of Mathematics, Massachusetts Institute of Technology, Cambridge MA 02139, USA, ude.tim.htam@gnaweil;.

Abstract

We propose a calibrated multivariate regression method named CMR for fitting high dimensional multivariate regression models. Compared with existing methods, CMR calibrates regularization for each regression task with respect to its noise level so that it simultaneously attains improved finite-sample performance and tuning insensitiveness. Theoretically, we provide sufficient conditions under which CMR achieves the optimal rate of convergence in parameter estimation. Computationally, we propose an efficient smoothed proximal gradient algorithm with a worst-case numerical rate of convergence O(1/ϵ), where ϵ is a pre-specified accuracy of the objective function value. We conduct thorough numerical simulations to illustrate that CMR consistently outperforms other high dimensional multivariate regression methods. We also apply CMR to solve a brain activity prediction problem and find that it is as competitive as a handcrafted model created by human experts. The R package camel implementing the proposed method is available on the Comprehensive R Archive Network http://cran.r-project.org/web/packages/camel/.

Keywords: calibration, multivariate regression, high dimension, sparsity, low Rank, brain activity prediction

1. Introduction

This paper studies the multivariate regression problem. Let X [set membership] Rn×d be the design matrix and Y [set membership] Rn×m be the response matrix, we consider a linear model

Y=XB0+Z,
(1)

where B0 [set membership] Rd×m is an unknown regression coefficient matrix and Z [set membership] Rn×m is a noise matrix (Anderson, 1958; Breiman and Friedman, 2002). For a matrix A = [Ajk ] [set membership] Rd×m, we denote its jth row and kth column by Aj* = (Aj1, …, Ajm) [set membership] Rm and A*k = (A1k, …, Adk)T [set membership] Rd respectively. We assume that all Zi*’s are independently sampled from an m-dimensional distribution with mean 0 and covariance matrix Σ [set membership] Rm×m.

We can represent (1) as an ensemble of univariate linear regression models:

Yk=XBk0+Zk,k=1,,m,

which results in a multi-task learning problem (Baxter, 2000; Caruana, 1997; Caruana et al., 1996; Thrun, 1996; Ando and Zhang, 2005; Johnson and Zhang, 2008; Zhang et al., 2006; Zhang, 2006). Multi-task learning exploits shared common structure across tasks to obtain improved estimation performance. In the past decade, significant progress has been made on designing various modeling assumptions for multivariate regression.

One popular approach is to assume that the regression coefficients across different tasks are coupled by some shared common factors so that B0 has a low rank structure, i.e., rank(B0) « min(d, m). Under this assumption, a consistent estimator of B0 can be obtained by adopting either a non-convex rank constraint (Anderson, 1958; Izenman, 1975; Reinsel and Velu, 1998; Anderson, 1999; Reinsel and Velu, 1998; Izenman, 2008) or a convex relaxation using the nuclear norm regularization (Yuan et al., 2007; Amit et al., 2007; Argyriou et al., 2008; Negahban and Wainwright, 2011; Rohde and Tsybakov, 2011; Bunea et al., 2011, 2012; Bunea and Barbu, 2009; Mukherjee et al., 2012; Giraud, 2011; Argyriou et al., 2010; Foygel and Srebro, 2011; Johnson and Zhang, 2008; Salakhutdinov and Srebro, 2010; Evgeniou et al., 2006; Heskes, 2000; Teh et al., 2005; Yu et al., 2005). Such a low rank multivariate regression method is often applied to scenarios where m is large.

Another approach is to assume that all the regression tasks share a common sparsity pattern, i.e., many Bj0 ’s are zero vectors. Such a joint sparsity assumption for multivariate regressions is a natural extension from sparse univariate linear regressions. Similar to using the L1-regularization in Lasso (Tibshirani, 1996; Chen et al., 1998), group regularization can be used to obtain a consistent estimator of B0 (Yuan and Lin, 2005; Turlach et al., 2005; Meier et al., 2008; Lounici et al., 2011; Kolar et al., 2011). Such a sparse multivariate regression method is often applied to scenarios where the dimension d is large.

In this paper, we consider an uncorrelated structure for the noise matrix Z, i.e.,

Σ=diag(σ12,σ22,,σm12,σm2).
(2)

Such an assumption allows us to efficiently solve the resulting estimation problem with a convex program and prove that the obtained estimator achieves the minimax optimal rates of convergence in parameter estimation.1 For example, many existing work propose to solve the convex program

B^=argminB1nYXBF2+λR(B),
(3)

where λ > 0 is a tuning parameter, R(B) is a regularization function of B, and ||A||F = j,kAjk2 is the Frobenius norm of a matrix A. Popular choices of R(B) include

NuclearNorm:B=j=1rψj(B),
(4)

L1,pNorm:B1,p=j=1d(k=1mBjkp)1pfor2p,
(5)

L1,Norm:B1,=j=1dmax1kmBjk,
(6)

where r in (4) is the rank of B and ψj (B) represents the jth largest singular value of B. The optimization problem (3) can be efficiently solved by the block coordinate descent algorithm (Liu et al., 2009a,b; Liu and Ye, 2010; Zhao et al., 2014a,c), fast proximal gradient algorithm (Toh and Yun, 2010; Beck and Teboulle, 2009a,b), and alternating direction method of multipliers(Boyd et al., 2011; Liu et al., 2014b). Scalable software packages such as MALSAR have been developed (Zhou et al., 2012).

The problem in (2) is amenable to statistical analysis. Under suitable conditions on the noise and design matrices, let σmax = maxk σk and ||X||2 = ψ1(X) denote the largest singular value of X, if we choose

LowRank:λ=2cX2nσmax(d+m),
(7)

JointSparsity:λ=2cσmax(logd+m11p),
(8)

for some c > 1, then the estimator B^ in (3) achieves the optimal rates of convergence2 (Lounici et al., 2011; Rohde and Tsybakov, 2011). More specifically, there exists some universal constant C such that, with high probability,

LowRank:1mB^B0FCX2nσmax(rn+rdnm),

JointSparsity:1mB^B0FCσmax(slogdnm+sm12pn),

where r is the rank of B0 for the low rank setting and s is the number of rows with non-zero entries in B0 for the setting of joint sparsity.

The estimator in (3) has two drawbacks: (i) All the tasks are regularized by the same tuning parameter λ, even though different tasks may have different σk ’s. Thus more estimation bias is introduced to the tasks with smaller σk ’s since they have to compensate the tasks with larger σk ’s. In another word, these tasks are not calibrated (Zhao and Liu, 2014). (ii) The tuning parameter selection, as shown in (7) and (8), involves the unknown parameter σmax. This requires the regularization parameter to be carefully tuned over a wide range of potential values in order to get a good finite-sample performance.

To overcome the above two drawbacks, we propose a new method named calibrated multivariate regression (CMR) based on the convex program

B^=argminBYXB2,1+λR(B)
(9)

where ||A||2,1 = kjAjk2 is the L2,1 norm of a matrix A = [Ajk ] [set membership] Rd×m. This is a multivariate extension of the square-root Lasso estimator (Belloni et al., 2011; Sun and Zhang, 2012). Similar to the square-root Lasso, the tuning parameter selection of CMR does not involve σmax. Thus the resulting procedure adapts to different σk ’s and achieves an improved finite-sample performance comparing with the ordinary multivariate regression estimator (OMR) defined in (3). Since both the loss and regularization functions in (9) are nonsmooth, CMR is computationally more challenging than OMR. To efficiently solve CMR, we develop a smoothed proximal gradient algorithm with a worst-case iteration complexity of O(1/ϵ), where ϵ is a pre-specified accuracy of the objective value (Nesterov, 2005; Chen et al., 2012; Zhao and Liu, 2012; Zhao et al., 2014b). Theoretically, we show that under suitable conditions, CMR achieves the optimal rates of convergence in parameter estimation. Numerical experiments on both synthetic and real data show that CMR universally out-performs existing multivariate regression methods. For a brain activity prediction task, prediction based on the features selected by CMR significantly outperforms that based on the features selected by OMR, and is even competitive with that based on the handcrafted features selected by human experts.

This paper is organized as follows: In §2, we describe the CMR method. In §3, we investigate the statistical properties of CMR; In §4, we derive a smoothed proximal gradient algorithm for solving CMR optimization. In §5, we conduct numerical experiments to illustrate the usefulness of the proposed method. In §6, we discuss the relationships between our results and other related work. Notation: Given a vector v = (v1, …, vd)T [set membership] Rd, for 1 ≤ p ≤ ∞, we define the vector norms: vp=(j=1dvjp)1p for 1 ≤ p < ∞ and ||v|| = max1≤jd |vj|. Given two matrices A = [Ajk ] and C = [Cjk ] [set membership] Rd×m, we define the inner product of A and C as A,C=j=1dk=1mAjkCjk=tr(ATC), where tr(A) is the trace of a matrix A. We use A*k = (A1k, …, Adk)T and Aj* = (Aj1, …, Ajm) to denote the kth column and jth row of A. Let S be some subspace of Rd×m, we use AS to denote the projection of A onto S, i.e., AS=argminCSCAF2. Given a subspace U [subset or is implied by] Rd, we define its orthogonal complement as U[perpendicular] = {u [set membership] Rd | uT v = 0, for all v [set membership] U}. Moreover, we define the Frobenius, spectral, and nuclear norms of A as ||A||F = A,A, ||A||2 = ψ1(A), and ||A||* = j=1rψj(A), where r is the rank of A, and ψj (A) is the jth largest singular value of A. In addition, we define the matrix block norms as ||A||2,1 = k=1m ||A*k||2, ||A||2,∞ = max1≤km ||A*k||2, ||A||1,p = j=1d ||Aj*||p, and ||A||∞, q = max1≤jd ||Aj*||2, where 1 ≤ p ≤ ∞ and 1 ≤ q ≤ ∞. It is easy to verify that ||A||2,1 and ||A||* are dual norms of ||A||2,∞ and ||A||2 respectively. Let 1/∞ = 0, then if 1/p + 1/q = 1, ||A||∞,q and ||A||1,p are also dual norms of each other.

2. Method

We solve the multivariate regression problem in (1) by the convex program

B^=argminBYXB2,1+λR(B),
(10)

where R(B) is a regularization function and can take the forms in (4), (5), and (6).

To understand the intuition of (10), we show that the L2,1-loss can be viewed as a special case of the weighted square loss function. More specifically, we consider the optimization problem

B^=argminBk=1m1σknYkXBk22+λR(B),
(11)

where 1σkn is the weight to calibrate the kth regression task. B^ is an “oracle” estimator (not practically calculable) since it assumes that all σk ’s are given. Without any prior knowledge of σk ’s, we can use the following replacement of σk ’s,

σ~k=1nYkXBk2,k=1,,m.
(12)

We then recover (10) by replacing σk in (12) by σ~k. In another word, CMR calibrates different tasks by solving a regularized weighted least square square problem with weights defined in (12).

3. Statistical Properties

For notational simplicity, we define a rescaled noise matrix W = [Wik] [set membership] Rn×m with Wik = Zik/σk, where 𝔼Zik2=σk2 is defined in (2). Thus W is a random matrix with all entries having mean 0 and variance 1. We define G0 as the gradient of ||YXB||2,1 at B = B0. We see that G0 does not depend on the unknown quantities σk ’s since

Gk0=XTZkZk2=XTWkσkWkσk2=XTWkWk2.

Thus it serves as an important pivotal in our analysis. Moreover, our analysis exploits the decomposability of R(B), which is satisfied by the nuclear and L1,p norms (Negahban et al., 2012).

Definition 1

Let S and N be two subspaces of Rd×m, which are orthogonal to each other and also satisfy S [subset, dbl equals] N[perpendicular]. A regularization function R(·) is decomposable with respect to the pair (S, N) if for any A [set membership] Rd×m, we have

R(A+C)=R(A)+R(C)forASandCN.

The decomposability of R(B) is important in analyzing the statistical properties of the estimator in (10). The next lemma shows that if we choose S to be some subspace of Rd×m containing the true parameter B0, given a decomposable regularizer and a suitably chosen λ, the optimum to (10) lies in a restricted set.

Lemma 2

Let B0 [set membership] S and B^ be an arbitrary3 optimum to (10). We denote the estimation error as Δ^=B^B0 and the dual norm of R(·) as R*(·). If λ ≥ cR*(G0) for some c > 1, we have

Δ^Mc={Δd×mR(ΔN)c+1c1R(ΔN)}.
(13)

The proof of Lemma 2 is provided in Appendix A. To prove the main result, we assume that the design matrix X satisfies a generalized restricted eigenvalue condition as below.

Assumption 1

Let B0 [set membership] S, then there exist positive constants κ and c > 1 such that

κ=minΔMc{0}XΔFnΔF

Assumption 1 is the generalization of the restricted eigenvalue conditions for analyzing univariate sparse linear models (Negahban et al., 2012; Bickel et al., 2009). Many design matrices satisfy this assumption with high probability (Lounici et al., 2011; Negahban and Wainwright, 2011; Rohde and Tsybakov, 2011; Raskutti et al., 2010).

3.1 Main Result

We first present a deterministic result for a general norm-based regularization function R(·), which satisfies the decomposability in Definition 1.

Theorem 3

Suppose that the design matrix X satisfies Assumption 1. Let B^ be an arbitrary optimum to (10), and G0 be the gradient of ||YXB||2,1 at B = B0. We denote

Θ(N,R)=maxAd×m{0}R(AN)ANF.

Let λ satisfy

2λΘ(N,R)δ(c1)nκforsomeδ<1,andλcR(G0).

Then we have

1nmXB^XB0F4λΘ(N,R)σmaxmnκ(c1)(1δ)W2,,

1mB^B0F4λΘ(N,R)σmaxmnκ2(c1)(1δ)W2,,

where σmax=max1kmσk. Moreover, if we estimate σks by

σ^k=1nYkXB^k2forallk=1,,m,
(14)

then we have

1mk=1mσ^kk=1mσkmax{1,2c1}4λ2Θ2(N,R)σmaxnmnκ(c1)(1δ)W2,.

The proof of Theorem 3 is provided in Appendix B. Note that Theorem 3 is a deterministic bound of the CMR estimator for a fixed λ. Since W is a random matrix, we need to bound ||W||2,∞ and show that λ ≥ cR*(G0) holds with high probability. For simplicity, we assume that each entry of W follows a Gaussian distribution as follows.

Assumption 2

All Wiks are independently generated from N(0,1).

We then refine error bounds of the CMR estimator under Assumption 2 for calibrated sparse multivariate regression and calibrated low rank multivariate regression respectively.

3.2 Calibrated Low Rank Multivariate Regression

We assume that the rank of B0 is r « min{d, m}, and B0 has a singular value decomposition B0=j=1rψj(B0)ujvjT where ψj (B0) is the jth largest singular value with uj ’s and vj ’s as the corresponding left and right singular vectors. We define

U=span({u1,,ur})dandV=span({v1,,vr})m.

We then define S and N as follows,

S={Cd×mCkU,CjVforallj,k},
(15)

N={Cd×mCkU,CjVforallj,k}.
(16)

We can easily verify that B0 [set membership] S and the nuclear norm is decomposable with respect to the pair (S, N), i.e.,

A+C=A+CforASandCN.

The next corollary provides the concrete rates of convergence for the calibrated low rank multivariate regression estimator.

Corollary 4

We assume that the design matrix X satisfies Assumption 1 with S and N chosen as in (15) and (16), and each column of X is normalized so that

X,j2n=1forallj=1,,d.
(17)

We also assume that the rescaled noise matrix W satisfies Assumption 2. By Theorem 3, for some universal constants c0 [set membership] (0, 1), c1 > 0, and large enough n, we take

λ=2cX2(d+m)n(1c0),
(18)

then for some δ < 1, we have

1nmXB^XB0F8c2X2σmaxnκ(c1)(1δ)1+c01c0(rn+rdnm),1mB^B0F8c2X2σmaxnκ2(c1)(1δ)1+c01c0(rn+rdnm),1mk=1mσ^kk=1mσkmax{1,2c1}64c2X22σmaxnκ(c1)(1σ)1+c01c0(rdnm+rn)

with probability at least 1 − 2 exp(−c1dc1m) − 2 exp (nc028+logm).

The proof of Corollary 4 is provided in Appendix C. The rate of convergence obtained in Corollary 4 matches the minimax lower bound4 presented in Rohde and Tsybakov (2011). See more details in Theorems 5 and 6 of Rohde and Tsybakov (2011).

3.3 Calibrated Sparse Multivariate Regression

We now assume that the multivariate regression model in (1) is jointly sparse. More specifically, we assume that B0 has s rows with nonzero entries and define

S={Cd×mCj=0foralljsuchthatBj0=0},
(19)

N={Cd×mCj=0foralljsuchthatBj00}.
(20)

We can easily verify that we have B0 [set membership] S and the L1,p norm is decomposable with respect to the pair (S, N), i.e.,

A+C1,p=A1,p+C1,pforASandCN.

The next corollary provides the concrete rates of convergence for the calibrated sparse multivariate regression estimator.

Corollary 5

We assume that the design matrix X satisfies Assumption 1 with S and N chosen as in (19) and (20), and each column of X is normalized so that

m121pXj2n=1forallj=1,,d.
(21)

We also assume that the rescaled noise matrix W satisfies Assumption 2. By Theorem 3, for some universal constant c0 [set membership] (0, 1) and large enough n, let

λ=2c(m11p+logd)1c0,
(22)

then for some δ < 1, we have

1nmXB^XB0F8cσmaxκ(c1)(1δ)1+c01c0(sm12pn+slogdnm),1mB^B0F8cσmaxκ2(c1)(1δ)1+c01c0(sm12pn+slogdnm),1mk=1mσ^kk=1mσkmax{1,2c1}32c2σmaxκ(c1)(1σ)1+c01c0(sm12pn+Slogdmn)

with probability at least 1 − 2 exp(−2 log d) − 2 exp (nc028+logm).

The proof of Corollary 5 is provided in Appendix D. Note that when we choose p = 2, the column normalization condition (21) becomes

Xj2n=1forallj=1,,d,

which is the same as (17). Then Corollary 5 implies that with high probability, we have

1mB^B0F8cσmaxκ2(c1)(1σ)1+c01c0(sn+slogdnm).
(23)

The rate of convergence obtained in (23) matches the minimax lower bound presented in Lounici et al. (2011). See more details in Theorem 6.1 of Lounici et al. (2011).

Remark 6

From Corollaries 4 and 5, we see that CMR achieves the same rates of convergence as the noncalibrated counterpart in parameter estimation. Moreover, the selected regularization parameter λ in (18) and (22) does not involve σk ’s. Therefore CMR makes the regularization parameter selection insensitive to σmax.

4. Computational Algorithm

Though the L2,1 norm is nonsmooth, it is nondifferentiable only when a task achieves exact zero residual, which is unlikely to happen in practice. This motivates us to apply the smoothing approach proposed by Nesterov (2005) to obtain a smooth approximation so that we can avoid directly evaluating the subgradient of the L2,1 loss function. Thus we gain computational efficiency like other smooth loss functions.

4.1 Smooth Approximation

We consider the Fenchel’s dual representation of the L2,1 loss:

YXB2,1=maxU2,1U,YXB.

Let µ > 0 be a smoothing parameter. The smooth approximation of the L2,1 loss can be obtained by solving the optimization problem

YXBμ=maxU2,1U,YXBμ2UF2.
(24)

Note that the equality in (24) is attained with U = U^B:

U^kB=YkXBkmax{YkXBk2,μ}.

Nesterov (2005) has shown that ||YXB||µ have good computational structures: (1) It is convex and differentiable with respect to B; (2) Its gradient takes a simple form as

Gμ(B)=YXBμB=(U^B,YXBμU^BF22)B=XTU^B;

(3) Let γ = ||XT X||2, we have that Gµ(B) is Lipschitz continuous in B with the Lipschitz constant γ/µ, i.e., for any B, B′′ [set membership] Rd×m,

Gμ(B)Gμ(B)FγμBBF.

Therefore we consider a smoothed replacement of the optimization problem in (10):

B~=argminBYXBμ+λR(B).
(25)

4.2 Smoothed Proximal Gradient Algorithm

We then present a brief derivation of the smoothed proximal gradient algorithm for solving (25). We first define three sequences of auxiliary variables {A(t)}, {V(t)}, and {H(t)} with A(0) = H(0) = V(0) = B(0), a sequence of weights {θt = 2/(t + 1)}, and a nonincreasing sequence of step sizes {ηt}t=0.

At the tth iteration, we take V(t) = (1 − θt)B(t−1) + θtA(t−1). Let H~(t)=V(t)ηtGμ(V(t)). When R(H) = ||H||*, we take

H(t)=j=1min{d,m}max{ψj(H~(t))ηtλ,0}ujvjT,

where uj and vj are the left and right singular vectors of H~(t) corresponding to the jth largest singular value ψj(H~(t)). When R(H) = ||H||1,2, we take

Hj(t)=H~jmax{1ηtλH~j2,0}.

See more details about other choices of p in the L1,p norm in Liu et al. (2009a); Liu and Ye (2010). To ensure that the objective function value is nonincreasing, we choose

B(t)=argminB{H(t),B(t1)}YXBμ+λR(B).

For simplicity, we can set {ηt} as a constant sequence, e.g., ηt = µ/γ for t = 1, 2, …. In practice, we cam use the backtracking line search to adjust ηt and boost the performance. At last, we take A(t)=B(t1)+1θt(H(t)B(t1)). Given a stopping precision ε, the algorithm stops when max {||B(t)B(t−1)||F, ||H(t)H(t−1)||F} ≤ ε.

Remark 7

The smoothed proximal gradient algorithm has a worst-case iteration complexity of O(1/E), where E is a pre-specified accuracy of the objective value5. See more details in Nesterov (2005); Beck and Teboulle (2009a).

5. Numerical Experiments

To compare the finite-sample performance between the calibrated multivariate regression (CMR) and ordinary multivariate regression (OMR), we conduct numerical experiments on both simulated and real data sets.

5.1 Simulated Data

We generate training data sets of 400 samples for the low rank setting and 200 samples for joint sparsity setting. In details, for the low rank setting, we use the following data generation scheme:

  • (1) Generate each row of the design matrix Xi*, i = 1, …, 400, independently from a 200-dimensional normal distribution N (0, Σ) where Σjj = 1 and Σj[ell] = 0.5 for all [ell]j.
  • (2) Generate the regression coefficient matrix B0 = LRT, where L [set membership] R200×3, R [set membership] R3×101, and all entries of L and R are independently generated from N (0, 0.05).
  • (3) Generate the random noise matrix Z = WD where W [set membership] R400×101 with all entries of W independently generated from N(0, 1) and D is either of the following matrices
    D=σmaxdiag(20100,23100,,2297100,2300100)101×101,,
    (26)
    D=σmaxdiag(1,1,,1,1)101×101.
    (27)

For the joint sparsity setting, we use the following data generation scheme:

  • (1) Generate each row of the design matrix Xi*, i = 1, …, 200, independently from a 800-dimensional normal distribution N (0, Σ) where Σjj = 1 and Σj[ell] = 0.5 for all [ell]j.
  • (2) Let k = 1, …, 13, set the regression coefficient matrix B0 [set membership] R800×13 as B1k0 = 3, B2k0 = 2, B4k0 = 1.5, and Bjk0 = 0 for all j ≠ 1, 2, 4.
  • (3) Generate the random noise matrix Z = WD, where W [set membership] R200×13 with all entries of W independently generated from N (0, 1) and D is is either of the following matrices
    D=σmaxdiag(204,214,,2114,2124)13×13,
    (28)
    D=σmaxdiag(1,1,,1,1)13×13.
    (29)

In addition, we generate validation sets (400 samples for the low rank setting and 200 samples for the joint sparsity setting) for the regularization parameter selection, and testing sets (10,000 samples for both settings) to evaluate the prediction accuracy.

Remark 8

The scale matrices in (26) and (28) consider the scenario, where the regression tasks have different variances. The scale matrices in (27) and (29) consider the scenario, where all regression tasks have the equal variance.

In numerical experiments, we set σmax = 1, 2, and 4 to illustrate the tuning insensitivity of CMR. The regularization parameter λ of both CMR and OMR is chosen over a grid

Λ={2404λ0,2394λ0,,2174λ0,2184λ0}.

We choose

λ0=X2n(d+m)andλ0=logd+m

for the low rank and joint sparsity settings. The optimal regularization parameter λ^ is determined by the prediction error as

λ^=argminλΛY~X~B^λF2,

where B^λ denotes the obtained estimate using the regularization parameter λ, and X~ and Y~ denote the design and response matrices of the validation set.

Since the noise level σk ’s may vary across different regression tasks, we adopt the following three criteria to evaluate the empirical performance:

P.E.=110000YXB^F2,A.P.E.=110000m(YXB^)D1F2,E.E.=1mB^B0F2,

where X and Y denote the design and response matrices of the testing set.

All simulations are implemented by MATLAB using a PC with Intel Core i5 3.3GHz CPU and 16GB memory. We set p = 2 for the joint sparsity setting, but it is straightforward to extend to arbitrary p > 2. OMR is solved by the monotone fast proximal gradient algorithm, where we set the stopping precision ε = 10−4. CMR is solved by the proposed smoothed proximal gradient algorithm, where we set the stopping precision ε = 10−4, and the smoothing parameter µ = 10−4.

We compare the statistical performance between CMR and OMR. Tables 1--44 summarize the results averaged over 200 simulations for both settings. In addition, since we know the true values of σk ’s, we also present the results of the oracle estimator B^ defined in (11). The oracle estimator is only for comparison purpose, and it is not a practical estimator.

Table 1
Quantitative comparison of the statistical performance between CMR and OMR for the low rank setting with D defined in (26). The results are averaged over 200 simulations with the standard errors in parentheses. CMR universally outperforms OMR, and achieves ...
Table 4
Quantitative comparison of the statistical performance between CMR and OMR for the joint sparsity setting with D defined in (29). The results are averaged over 200 simulations with the standard errors in parentheses. CMR and OMR achieve similar statistical ...

Tables 1 and and33 present the empirical results when we adopt the scale matrix D defined in (26) and (28) to generate the random noise. Though our theoretical analysis in §3 only shows CMR attains the same rates of convergence as OMR, our empirical results show that CMR universally outperforms OMR, and achieves almost the same performance as the oracle estimator. These results corroborate the effectiveness of the calibration for each task.

Table 3
Quantitative comparison of the statistical performance between CMR and OMR for the joint sparsity setting with D defined in (28). The results are averaged over 200 simulations with the standard errors in parentheses. CMR universally outperforms OMR, and ...

Tables 2 and and44 present the empirical results when we adopt the scale matrix D defined in (27) and (29) with all σk ’s being equal. We can see that CMR attains similar performance to OMR. This indicates that CMR is a safe replacement of OMR for multivariate regressions.

Table 2
Quantitative comparison of the statistical performance between CMR and OMR for the low rank setting with D defined in (27). The results are averaged over 200 simulations with the standard errors in parentheses. CMR and OMR achieve similar statistical ...

In addition, we also examine the optimal regularization parameters for CMR and OMR over all replicates. We visualize the distribution of all 200 selected λ^’s using the kernel density estimator. In particular, we adopt the Gaussian kernel, and select the kernel bandwidth based on the 10-fold cross validation. Figure 1 illustrates the estimated density functions. The horizontal axis corresponds to the rescaled regularization parameter as follows:

LowRank:log(λ^(d+m)X2n),

JointSparsity:log(λ^logd+m).

We see that the optimal regularization parameters of OMR significantly vary with different σmax. In contrast, the optimal regularization parameters of CMR are more concentrated. This is consistent with our claimed tuning insensitivity.

Figure 1
The distributions of the selected regularization parameters using the kernel density estimator. The numbers in the parentheses are σmax’s. The optimal regularization parameters of OMR are more spread with different σmax than those ...

5.2 Real Data

We apply CMR on a brain activity prediction problem which aims to build a parsimonious model to predict a person’s neural activity when seeing a stimulus word. As is illustrated in Figure 2, for a given stimulus word, we first encode it into an intermediate semantic feature vector using some corpus statistics. We then model the brain’s neural activity pattern using CMR. Creating such a predictive model not only enables us to explore new analytical tools for the fMRI data, but also helps us to gain deeper understanding on how human brain represents knowledge (Mitchell et al., 2008). As will be shown in the section, prediction based on the features selected by CMR significantly outperforms that based on the features selected by OMR, and is even better than that based on the handcrafted features selected by human experts.

Figure 2
An illustration of the fMRI brain activity prediction problem (Mitchell et al., 2008). (a) To collect the data, a human participant sees a sequence of English words and their images. The corresponding fMRI images are recorded to represent the brain activity ...

5.2.1 Data

The data are obtained from Mitchell et al. (2008) and contain a fMRI image data set and a text data set. The fMRI data are collected from an experiment with 9 participants.60 nouns are selected as stimulus words from 12 different categories (See Table 5). When a participant sees a stimulus word, the fMRI device records an image6. Each image contains 20,601 voxels that represent the neural activities of the participant’s brain. Therefore the total number of images is 9 × 60 = 540. Since many of the 20,601 voxels are noisy, Mitchell et al. (2008) exploit a “stability score” approach to extract 500 most stable voxels. See more details in Mitchell et al. (2008).

Table 5
The 60 stimulus words used in Mitchell et al. (2008) from 12 categories (5 per category).

The text data set is collected from the Google Trillion Word corpus7. It contains the co-occurrence frequencies of the 60 stimulus words with 5,000 most frequent English words in the corpus with 100 stop words removed. In Mitchell et al. (2008), 25 sensory-action verbs (See Table 6) are handcrafted by human experts based on the domain knowledge of cognitive neuroscience. These 25 words are closely related to the 60 stimulus words in their semantics meanings. For example, “eat” is related to vegetables such as “lettuce” or “tomato”, and “wear” is related to clothing such as “shirt” and “dress”.

Table 6
The 25 verbs used in Mitchell et al. (2008). They are handcrafted based on the domain knowledge of cognitive science, and are independent on the data set.

When building multivariate linear models, Mitchell et al. (2008) use the co-occurrence frequencies of each stimulus word with 25 sensory verbs as covariates and use the corresponding fMRI image as response. They estimate a 25-dimensional multivariate linear model by the ridge regression. They show that the obtained predictive model significantly outperforms random guess. Thus, they treat these 25 words as a semantic basis.

In our experiment below, we apply CMR to automatically select a semantic basis from all 5,000 most frequent English words. Compared with the protocol used in Mitchell et al. (2008), our approach is completely data-driven and outperforms the handcraft method in the brain activity prediction accuracy for 5 out of 9 participants.

5.2.2 Experimental Protocol in Mitchell et al. (2008)

The evaluation procedure of Mitchell et al. (2008) is based on the leave-two-out cross validation over all (602) = 1, 770 possible partitions. In each partition, we select 58 stimulus words out of 60 as the training set. Recall that each stimulus word is represented by 5,000 features and each feature is the co-occurrence frequency of a potential basis word with the stimulus word, we obtain a 58 × 5, 000 design matrix. Similarly, we can format the fMRI images corresponding to the 58 training stimulus words into a 58 × 500 response matrix. In the training stage, we apply CMR and OMR to select 25 basis words by adjusting the regularization parameters. We then use the remaining two stimulus words as a validation set and apply the estimated models to predict the neural activity of these two stimulus words. We evaluate the prediction performance based on the combined cosine similarity measure defined as follow.

Definition 9 (Combined Similarity Measure, Mitchell et al. (2008))

Let u [set membership] Rm and v [set membership] Rm denote the observed fMRI images of two stimulus words in the validation set, and u^ [set membership] Rm and v^ [set membership] Rm denote the corresponding predicted fMRI images. We say that the predicted images u^ and v^ correctly label two validation stimulus words, if

cos(u,u^)+cos(v,v^)>cos(u,v^)+cos(v,u^),
(30)

where cos(u, v) = (uTv)/(||u||2||v||2).

We then summarize the overall prediction accuracy for each participant by the percentage of the correct labelings over all 1,770 partitions. Table 7 presents the prediction accuracies for the 9 participants. We see that CMR universally outperforms OMR across all 9 participants by 4.42% on average. Note that the statistically significant accuracy at 5% level is 0.61, CMR achieves statistically significant advantages for 8 out of 9 participants.

Table 7
Prediction accuracies evaluated using the experimental protocol in Mitchell et al. (2008). CMR universally outperforms OMR across all participants.

5.2.3 An Improved Experimental Protocol

There are two drawbacks of the previous protocol: (1) The selected basis words vary a lot across different partitions of the cross validation and participants. Such high variability makes the obtained results difficult to interpret; (2) The automatic semantic basis selection method of CMR and OMR is sensitive to data outliers, which are common in fMRI studies. In this section, we improve this protocol to address these two problems in a more data-driven manner.

Our main idea is to simultaneously exploit the training data of multiple participants and use the stability criterion to select more stable semantic basis words (Meinshausen and Bühlmann, 2010). In detail, for each participant to be evaluated, we choose three other representatives out of the remaining eight according to who achieve the best three leave-two-out cross validation prediction accuracies in Table 7. Taking Participant 2 and CMR as an example, the three selected representatives are Participants 1, 3, and 9 with the three highest accuracies of 0.783, 0.772, and 0.763. In this way, we could eliminate the effects of possible data outliers. We then combine the fMRI images of three representatives and formulate a multivariate regression problem with 1,500 dimensional response. We conduct the leave-two-out cross validation as in the previous protocol using the combined data set, and count the frequency of each potential basis word that appears in all 1,770 partitions. We then choose the 25 most frequent words as the semantic basis. Finally, we apply the same procedure as in the previous protocol on the current candidate participant and evaluate the prediction accuracy using the combined cosine score.

Table 8 summarizes the prediction performance based on this improved protocol. We also report the results obtained by the 25 handcrafted basis. Compared with the results in Table 7, we see that the performance of CMR is greatly improved. For Participants 1, 2, 3, 5, and 8, the prediction performance of CMR significantly outperforms the handcraft method. Moreover, since the candidate participant is not involved in the semantic basis word selection, our results imply that the selected semantic basis have good generalization capability across participants.

Table 8
Prediction accuracies evaluated used a more heuristic protocol. CMR significantly outperforms the handcrafted basis words for 5 out of 9 participants.

Table 9 lists 35 basis words obtained by CMR using the improved protocol. The words in the bold font are common ones shared by all 9 participants. We see that our list contains nouns, adjectives, and verbs. These words are closely related to the 60 stimulus words. For example, lodge, hotel, and floor are closely related to “building” and “building parts”; green and fruit clearly refer to words in “vegetable”; built and using are related to “tools” and “man made objects”.

Table 9
The 35 basis words selected by CMR using the improved protocol. T font are shared by predictive models for all 9 participants.

6. Discussion and Conclusion

Two other related methods are the square-root low rank multivariate regression (Klopp, 2011) and the square-root sparse multivariate regression (Bunea et al., 2013). They solve the convex program

B^=argminBYXBF+λR(B).
(31)

The Frobenius loss in (31) makes the regularization parameter selection independent of σmax, but it does not calibrate different regression tasks. We can rewrite (31) as

(B^,σ^)=argminB,σ1nmσYXBF2+λR(B)subjecttoσ=1nmYXBF.
(32)

Since σ in (32) is not specific to any individual task, it cannot calibrate the regularization. Thus it is fundamentally different from CMR.

The calibration technique proposed in this paper is quite general, and can be extended to more sophisticated scenarios, e.g. the regularization function is weakly decomposable or geometrically decomposable (Geer, 2014; Lee et al., 2013), or the regression coefficient matrix can be decomposed into multiple structured matrices (Agarwal et al., 2012; Chen et al., 2011; Gong et al., 2012; Jalali et al., 2010; Obozinski et al., 2010). Accordingly, the extensions of our proposed theory are also straightforward. We only need to replace their squared Frobenius loss-based analysis with the L2,1 loss based analysis in this paper.

Appendix A. Proof of Lemma 2

Note that the following two relations are frequently used in our analysis,

YXB0=XB0+ZXB0=ZandYXB^=XB0+ZXB^=ZXΔ^.

Proof Since B0 [set membership] S, we have BS0 = 0. Then we have

R(B^)=R(B0+Δ^)=R(BS0+Δ^N+Δ^N)R(BS0+Δ^N)R(Δ^N).
(33)

Since R(·) is decomposable with respect to (S, N), (33) further implies

R(B^)R(BS0)+R(Δ^N)R(Δ^N).
(34)

Since B0 [set membership] S, we have R(B0) = R(BS0). Then by rearranging (34), we obtain

R(B0)R(B^)R(Δ^N)R(Δ^N).
(35)

Since B^ is the optimum to (10), by (34), we further have

XΔ^Z2,1Z2,1λ(R(B0)R(B0+Δ^))λ(R(Δ^N)R(Δ^N)).
(36)

Due to the convexity of || · ||2,1, we know

XΔ^Z2,1Z2,1G0,Δ^G0,Δ^.
(37)

By the Cauchy-Schwarz inequality, we obtain

G0,Δ^R(G0)R(Δ^)λc(R(Δ^N)+R(Δ^N)),
(38)

where the last inequality comes from the assumption λ ≥ cR*(G0) and the triangle inequality R(Δ^)R(Δ^N)+R(Δ^N). By combining (36), (37), and (38), we obtain

λc(R(Δ^N)+R(Δ^N))λ(R(Δ^N)R(Δ^N)).
(39)

By rearranging (39), we obtain (c1)R(Δ^N)(c+1)R(Δ^N), which completes the proof.

Appendix B. Proof of Theorem 3

Proof

We have

XΔ^Z2,1Z2,1=k=1m(XΔ^kZk2Zk2)=k=1mXΔ^k222(XΔ^k)TZkXΔ^kZk2+Zk2k=1mXΔ^k22XΔ^k2+2Zk22k=1mΔ^kTXTZkZk2.
(40)

Since Gk0=XTZkZk2, we have

k=1mΔ^kTXTZkZk2=k=1mΔ^kTGk0k=1mj=1dΔ^jkGjk0R(G0)R(Δ^),
(41)

where the last inequality follows from the Cauchy-Schwarz inequality. Recall that in the proof of Lemma 2, we already have (36) as follows,

XΔ^Z2,1Z2,1λ(R(Δ^N)R(Δ^N)).
(42)

Therefore by combining (42) with (40) and (41), we obtain

k=1mXΔ^k22XΔ^k2+2Zk2λ(R(Δ^N)R(Δ^N))+2R(G0)R(Δ^)λ(1+2c)R(Δ^N)+λ(2c1)R(Δ^N)2λc1R(Δ^N),
(43)

where the second inequality comes from the assumption λ ≥ cR*(G0) and the triangle inequality R(Δ^)R(Δ^N)+R(Δ^N), and the last inequality comes from (13) in Lemma 2. Meanwhile, by the triangle inequality, we also have

k=1mXΔ^k22XΔ^k2+2Zk2k=1mXΔ^k22XΔ^2,+2Z2,XΔ^F2XΔ^F+2XΔ^2,,
(44)

where the last inequality comes from the fact XΔ^2,XΔ^F. Combining (43) and (44), we obtain

XΔ^F2XΔ^F+2Z2,2λc1R(Δ^N)2λΘ(N,R)Δ^Fc1,
(45)

where the last inequality comes from the definition of Θ(N[perpendicular], R). By Assumption 1, we can rewrite (45) as

XΔ^F22λΘ(N,R)(c1)nκXΔ^F2+4λΘ(N,R)nκ(c1)Z2,XΔ^F.

Given 2λΘ(N,R)δ(c1)nκ for some δ < 1, we have

XΔ^F4λΘ(N,R)nκ(c1)(1δ)Z2,4λΘ(N,R)σmaxnκ(c1)(1δ)W2,.
(46)

By Assumption 1 again, we obtain

Δ^F4λΘ(N,R)σmaxnκ2(c1)(c1)W2,.
(47)

We proceed with the standard deviation estimation. By (36), we have

YXB^2,1YXB02,1λR(Δ^N)λR(Δ^N)λR(Δ^N).
(48)

Combining (48) with a simple variant of Assumption 1

κXΔ^FnΔ^FXΔ^FnΔ^NFΘ(N,R)XΔ^FnR(Δ^N),
(49)

we have

n(k=1mσ^kk=1mσk)λΘ(N,R)XΔ^Fnκ4λ2Θ2(N,R)σmaxnκ(c1)(1δ)W2,,
(50)

where the last inequality comes from (46). By (37), (38), and Lemma 2, we have

YXB^2,1YXB02,1λc(R(Δ^N)+R(Δ^N))2λc1R(Δ^N).
(51)

By (49) again, we have

n(k=1mσ^kk=1mσk)8λ2Θ2(N,R)σmaxnκ(c1)2(1δ)W2,.
(52)

Thus combining (50) and (52), we have

1mk=1mσ^kk=1mσkmax{1,2c1}4λ2Θ2(N,R)σmaxnmnκ(c1)(1δ)W2,.
(53)

Appendix C. Proof of Corollary 4

We need to introduce the following lemmas for our proof.

Lemma 10

Suppose that we have all entries of a random vector v = (v1, …, vn)T [set membership] Rn independently generated from the standard Gaussian distribution with mean 0 and variance 1. For any c0 [set membership] (0, 1), we have

(v22nc0n)2exp(nc028).

The proof of Lemma 10 is provided in Johnstone (2001), therefore omitted.

Lemma 11

Suppose that we have all entries of W independently generated from the standard Gaussian distribution with mean 0 and variance 1, then there exists some universal constant c1 such that

(XTW2n2X2n(m+d))12exp(c1(d+m)).
(54)

The proof of Lemma 11 is provided in Appendix E. Now we proceed to derive the refined error bound for the calibrated low rank regression estimator.

Proof

Since we have all entries of W independently generated from N (0, 1), then by Lemma 10, for any c0 [set membership] (0, 1), we have

((1c0)nWk2(1+c0)n)12exp(nc028).

By taking the union bound over all k = 1, …, m, we have

((1c0)nmin1kmWk2max1kmWk2(1+c0)n)12mexp(nc028).
(55)

Now conditioning on the event (1c0)nmin1kmWk2, we have

R(G0)=G02=maxv21k=1m(vTXTWk)2Wk22maxv21k=1m(vTXTWk)2(1c0)n=XTW2(1c0)n.
(56)

By Lemma 11, there exists some universal positive constant c1 such that we have

(XTW2(1c0)n2X2(d+m)n(1c0))12exp(c1(d+m)).
(57)

Given any matrix A in N[perpendicular], A has at most rank 2r (See more details in Appendix B of Negahban and Wainwright (2011)). Then we have

A=j=12rψj(A)2rj=12rψj(A)2=2rAF.

Therefore we have Θ(N,1,p)=2r Theorem 3 requires

2λΘ(N,R)δκ(c1)nforsomeδ<1.
(58)

Thus if we take

λ=2cX2(m+d)n(1c0)

then we need n to be large enough

n42cX2(rm+rd)δκ(c1)(1c0),

such that (58) can be secured. Then by combining (55), (56), (57), (47), and (53), we complete the proof.

Appendix D. Proof of Corollary 5

We need to introduce the following lemma for our proof.

Lemma 12

Suppose that we have all entries of W independently generated from the standard Gaussian distribution with mean 0 and variance 1, then we have

(max1jd1nWTXjq2(m11p+logd))12d,

where 1/p + 1/q = 1.

The proof of Lemma 12 is provided in Appendix F. Now we proceed to derive the refined error bound for the joint sparsity setting.

Proof

Recall that we already have (55),

((1c0)nmin1kmWk2max1kmWk2(1+c0)n)12mexp(nc028).
(59)

Now conditioning on the event (1c0)nmin1kmWk2, we have

R(G0)=G0,q=max1jd(k=1n(WkTXj)qWk2q)1qmax1jdWTXjqmin1kmWk2XTW,q(1c0)n.
(60)

By Lemma 12, we have

(XTW,q(1c0)n2m11p(1c0)+2logd(1c0))12d.
(61)

Given any matrix A in N[perpendicular], A has at most s nonzero rows. Then we have

A1,p=Aj0AjpAj0Aj2sAj0Aj22=sAF.

Therefore we have Θ(N,1,p)=s for any 2 ≤ p ≤ ∞. Theorem 3 requires

2λΘ(N,R)δκ(c1)nforsomeδ<1.
(62)

Thus if we take

λ=2c(m11p+logd)1c0,

then we need n to be large enough

n4cs(m11p+logd)δκ(c1)1c0,

such that (62) can be secured. Then by combining (59), (60), (61), (47), and (53), we complete the proof.

Appendix E. Proof of Lemma 11

Proof

Since W has all its entries independently generated from the standard Gaussian distribution with mean 0 and variance 1, then all XTWkn’s are essentially independently generated from a multivariate Gaussian distribution with mean 0 and covariance matrix XT X/n.

Thus by Corollary 5.50 in Vershynin (2010) on the singular values of Gaussian random matrices (Davidson and Szarek, 2001), we know that there exists a universal positive constant c1 such that

(XTW2n2X2n(m+d))12exp(c1(d+m)),
(63)

which completes the proof.

Appendix F. Proof of Lemma 12

Proof

We adopt the similar proof strategy in Negahban et al. (2012), and begin our proof by establishing the tail bound of WTXjqn.

Deviation above the mean

Given any pair of W, W~n×m and 1/q + 1/p = 1, we have

1nWTXjq1nW~TXjq1n(WW~)TXjq=1nmaxθp1θ,(WW~)TXj.
(64)

By the Cauchy-Schwartz inequality, we have

1nmaxθp1θXjT,WW~WW~Fnmaxθp1θXjTF.
(65)

Since θXjT is a rank one matrix, its singular value decomposition is

θXjT=θ2Xjθθ2XjTXj2.

Consequently, we have

1nmaxθp1θXjTF=Xj2nmaxθp1θ2(i)m121pXj2n(ii)1.
(66)

where (i) comes from ||θ||2m1/2−1/p||θ||p, and (ii) comes from the column normalization condition (21). Combining (64), (65), and (66), we obtain

1nWTXjq1nW~XjqWW~F.
(67)

which implies that WTXjqn is a Lipschitz continuous function of W with a Lipschitz constant as 1. By the Gaussian concentration of measure for Lipschitz functions (Ledoux and Talagrand, 2011), we have

(1nWTXjq𝔼1nWTXjq+ξ)2exp(ξ22).
(68)

Upper bound of the mean

Given any β [set membership] Rm, we define a zero mean Gaussian random variable Jβ=βTWTXjn, and note that we have 1nWTXjq=maxβp=1Jβ. Thus given any two vectors ||β||p ≤ 1 and ||βi||p ≤ 1, we have

𝔼(JβJβ)2=1nXj22ββ22ββ22,

where the last inequality comes from (21) and m1−1/p ≥ 1.

Then we define another Gaussian random variable Kβ = βTω, where ω = (ω1, …, ωm)T ~ N (0, Im) is standard Gaussian. By construction, for any pair β, βi [set membership] Rm, we have

𝔼[(KβKβ)2]=ββ22𝔼(JβJβ)2.

Thus by the Sudakov-Fernique comparison principle (Ledoux and Talagrand, 2011), we have

𝔼1nWTXjq=𝔼maxβp=1Jβ𝔼maxβp=1Kβ.

By definition of Kβ, we have

𝔼maxβp=1Kβ=𝔼ωqm1q(𝔼ω1q)1q,
(69)

where the last inequality comes from Jensen’s inequality and the fact that |ω1|1/q is a concave function of ω1 for q [set membership] [1, 2]. Eventually, by Hölder inequality, we obtain

(𝔼ω1q)1q𝔼ω12=1.
(70)

Combing (69) and (70), we obtain

𝔼maxβp=1Kβm11p2m11p.
(71)

Then combing (68) and (71), we have

(1nWTXjq2m11p+ξ)2exp(ξ22).

Taking the union bound over j = 1, …, d and let ξ=2logd, we have

(1nXTW,q2m11p+2logd)2d.

This finishes the proof.

Footnotes

*Some preliminaries results in this paper were presented at the 28-th Annual Conference on Neural Information Processing Systems, Montreal, Quebec, Canada, 2014 (Liu et al., 2014a). This work is partially supported by grants NIH R01MH102339, NSF IIS1408910, NSF IIS1332109, NSF CAREER DMS1454377, NIH R01GM083084, NIH R01HG06841, and NSF Grant DMS-1005539.

1See more details on exploiting the covariance structure of the noise matrix Z for multivariate regression in Breiman and Friedman (2002); Reinsel (2003); Rothman et al. (2010).

2For the joint sparsity setting, the rate of convergence is optimal when R(B) = ||B||1,2. See more details in Lounici et al. (2011)

3Since (10) is not a strictly convex program, the optimum to (10) is not necessarily unique.

4In the fixed design setting for the low rank regression, ||X||2 is supposed to increase as an order of n. Thus ||X||2/n in (18) should be viewed as a constant.

5During this paper was under review, a dual proximal gradient algorithm was proposed for solving (10). See more details in Gong et al. (2014).

6Each image is actually the average of 6 consecutive recordings of each word.

7http://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-to-you.html

References

  • Agarwal A, Negahban S, Wainwright M. Noisy matrix decomposition via convex relaxation: Optimal rates in high dimensions. The Annals of Statistics. 2012;40:1171–1197.
  • Amit Y, Fink M, Srebro N, Ullman S. Uncovering shared structures in multiclass classification. Proceedings of the 24th international conference on Machine Learning; ACM; 2007.
  • Anderson T. An introduction to multivariate statistical analysis. Wiley; New York: 1958.
  • Anderson T. Asymptotic distribution of the reduced rank regression estimator under general conditions. The Annals of Statistics. 1999;27:1141–1154.
  • Ando RK, Zhang T. A framework for learning predictive structures from multiple tasks and unlabeled data. The Journal of Machine Learning Research. 2005;6:1817–1853.
  • Argyriou A, Evgeniou T, Pontil M. Convex multi-task feature learning. Machine Learning. 2008;73:243–272.
  • Argyriou A, Micchelli CA, Pontil M. On spectral learning. The Journal of Machine Learning Research. 2010;11:935–953.
  • Baxter J. A model of inductive bias learning. Journal of Artificial Intelligence Research. 2000;12:149–198.
  • Beck A, Teboulle M. Fast gradient-based algorithms for constrained total variation image denoising and deblurring problems. IEEE Transactions on Image Processing. 2009a;18:2419–2434. [PubMed]
  • Beck A, Teboulle M. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences. 2009b;2:183–202.
  • Belloni A, Chernozhukov V, Wang L. Square-root lasso: pivotal recovery of sparse signals via conic programming. Biometrika. 2011;98:791–806.
  • Bickel PJ, Ritov Y, Tsybakov AB. Simultaneous analysis of lasso and dantzig selector. The Annals of Statistics. 2009;37:1705–1732.
  • Boyd S, Parikh N, Chu E, Peleato B, Eckstein J. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends OR in Machine Learning. 2011;3:1–122.
  • Breiman L, Friedman J. Predicting multivariate responses in multiple linear regression. Journal of the Royal Statistical Society: Series B. 2002;59:3–54.
  • Bunea F, Barbu A. Dimension reduction and variable selection in case control studies via regularized likelihood optimization. Electronic Journal of Statistics. 2009;3:1257–1287.
  • Bunea F, Lederer J, She Y. The group square-root lasso: Theoretical properties and fast algorithms. IEEE Transactions on Information Theory. 2013;60:1313–1325.
  • Bunea F, She Y, Wegkamp M. Joint variable and rank selection for parsimonious estimation of high dimensional matrices. The Annals of Statistics. 2012;40:2359–2388.
  • Bunea F, She Y, Wegkamp MH. Optimal selection of reduced rank estimators of high-dimensional matrices. The Annals of Statistics. 2011;39:1282–1309.
  • Caruana R. Multitask learning. Machine Learning. 1997;28:41–75.
  • Caruana R, Baluja S, Mitchell T, et al. Using the future to “sort out” the present: Rankprop and multitask learning for medical risk evaluation. Advances in Neural Information Processing Systems. 1996
  • Chen J, Zhou J, Ye J. Integrating low-rank and group-sparse structures for robust multi-task learning. Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining; ACM; 2011.
  • Chen S, Donoho D, Saunders M. Atomic decomposition by basis pursuit. SIAM Journal on Scientific Computing. 1998;20:33–61.
  • Chen X, Lin Q, Kim S, Carbonell JG, Xing EP. Smoothing proximal gradient method for general structured sparse regression. The Annals of Applied Statistics. 2012;6:719–752.
  • Davidson KR, Szarek SJ. Local operator theory, random matrices and Banach spaces. Handbook of the geometry of Banach spaces. 2001;1:317–366.
  • Evgeniou T, Micchelli CA, Pontil M. Learning multiple tasks with kernel methods. Journal of Machine Learning Research. 2006;6:615.
  • Foygel R, Srebro N. Concentration-based guarantees for low-rank matrix reconstruction. 24th Annual Conference on Learning Theory.2011.
  • Geer S. Weakly decomposable regularization penalties and structured sparsity. Scandinavian Journal of Statistics. 2014;41:72–86.
  • Giraud C. Low rank multivariate regression. Electronic Journal of Statistics. 2011;5:775–799.
  • Gong P, Ye J, Zhang C. Robust multi-task feature learning. Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining; ACM; 2012. [PMC free article] [PubMed]
  • Gong P, Zhou J, Fan W, Ye J. Efficient multi-task feature learning with calibration. Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; ACM; 2014.
  • Heskes T. Empirical Bayes for learning to learn. Proceedings of the 17th International Conference on Machine Learning.2000.
  • Izenman A. Reduced-rank regression for the multivariate linear model. Journal of multivariate analysis. 1975;5:248–264.
  • Izenman AJ. Modern multivariate statistical techniques: regression, classification, and manifold learning. Springer; 2008.
  • Jalali A, Ravikumar P, Sanghavi S, Ruan C. A dirty model for multi-task learning. Advances in Neural Information Processing Systems. 2010
  • Johnson R, Zhang T. Graph-based semi-supervised learning and spectral kernel design. IEEE Transactions on Information Theory. 2008;54:275–288.
  • Johnstone IM. Chi-square oracle inequalities. Lecture Notes-Monograph Series. 2001:399–418.
  • Klopp O. Tech. rep. Université Paris Ouest Nanterre La Défense; 2011. High dimensional matrix estimation with unknown variance of the noise. URL http://arxiv.org/abs/1112.3055.
  • Kolar M, Lafferty J, Wasserman L. Union support recovery in multi-task learning. Journal of Machine Learning Research. 2011;12:2415–2435.
  • Ledoux M, Talagrand M. Probability in Banach Spaces: isoperimetry and processes. Springer; 2011.
  • Lee J, Sun Y, Taylor JE. On model selection consistency of penalized m-estimators: a geometric theory. Advances in Neural Information Processing Systems. 2013
  • Liu H, Palatucci M, Zhang J. Blockwise coordinate descent procedures for the multi-task lasso, with applications to neural semantic basis discovery. Proceedings of the 26th Annual International Conference on Machine Learning; ACM; 2009a. URL http://arxiv.org/abs/1209.2437.
  • Liu H, Wang L, Zhao T. Multivariate regression with calibration. Advances in Neural Information Processing Systems. 2014a [PMC free article] [PubMed]
  • Liu H, Wang L, Zhao T. Sparse covariance matrix estimation with eigenvalue constraints. Journal of Computational and Graphical Statistics. 2014b;23:439–459. [PMC free article] [PubMed]
  • Liu J, Ji S, Ye J. Multi-task feature learning via efficient l2,1-norm minimization. Proceedings of the 25th Conference on Uncertainty in Artificial Intelligence; AUAI Press; 2009b.
  • Liu J, Ye J. Tech. rep. Arizona State University; 2010. Efficient l1/lq norm regularization. URL http://arxiv.org/abs/1009.4766.
  • Lounici K, Pontil M, Van De Geer S, Tsybakov A. Oracle inequalities and optimal inference under group sparsity. The Annals of Statistics. 2011;39:2164–2204.
  • Meier L, Van De Geer S, Bühlmann P. The group lasso for logistic regression. Journal of the Royal Statistical Society: Series B. 2008;70:53–71.
  • Meinshausen N, Bühlmann P. Stability selection. Journal of the Royal Statistical Society: Series B. 2010;72:417–473.
  • Mitchell T, Shinkareva S, Carlson A, Chang K, Malave V, Mason R, Just M. Predicting human brain activity associated with the meanings of nouns. Science. 2008;320:1191–1195. [PubMed]
  • Mukherjee A, Wang N, Zhu J. Tech. rep. University of Michigan Ann Arbor; 2012. Degrees of freedom of the reduced rank regression. URL http://arxiv.org/abs/1210.2464.
  • Negahban S, Wainwright MJ. Estimation of (near) low-rank matrices with noise and high-dimensional scaling. The Annals of Statistics. 2011;39:1069–1097.
  • Negahban SN, Ravikumar P, Wainwright MJ, Yu B. A unified framework for high-dimensional analysis of m-estimators with decomposable regularizers. Statistical Science. 2012;27:538–557.
  • Nesterov Y. Smooth minimization of non-smooth functions. Mathematical Programming. 2005;103:127–152.
  • Obozinski G, Taskar B, Jordan M. Joint covariate selection and joint subspace selection for multiple classification problems. Statistics and Computing. 2010;20:231–252.
  • Raskutti G, Wainwright MJ, Yu B. Restricted eigenvalue properties for correlated gaussian designs. The Journal of Machine Learning Research. 2010;11:2241–2259.
  • Reinsel G. Elements of multivariate time series analysis. Springer Verlag; 2003.
  • Reinsel G, Velu R. Multivariate reduced-rank regression: theory and applications. Springer; New York: 1998.
  • Rohde A, Tsybakov AB. Estimation of high-dimensional low-rank matrices. The Annals of Statistics. 2011;39:887–930.
  • Rothman A, Levina E, Zhu J. Sparse multivariate regression with covariance estimation. Journal of Computational and Graphical Statistics. 2010;19:947–962. [PMC free article] [PubMed]
  • Salakhutdinov R, Srebro N. Collaborative filtering in a non-uniform world: Learning with the weighted trace norm. Advances in Neural Information Processing Systems. 2010
  • Sun T, Zhang C. Scaled sparse linear regression. Biometrika. 2012;101:1–20.
  • Teh YW, Seeger M, Jordan MI. Semiparametric latent factor models. Proceedings of the International Workshop on Artificial Intelligence and Statistics.2005.
  • Thrun S. Is learning the n-th thing any easier than learning the first? Advances in Neural Information Processing Systems. 1996
  • Tibshirani R. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B. 1996;58:267–288.
  • Toh K-C, Yun S. An accelerated proximal gradient algorithm for nuclear norm regularized linear least squares problems. Pacific Journal of Optimization. 2010;6:15.
  • Turlach B, Venables W, Wright S. Simultaneous variable selection. Technometrics. 2005;47:349–363.
  • Vershynin R. Introduction to the non-asymptotic analysis of random matrices. Compressed Sensing, Theory and Applications. 2010:210–268.
  • Yu K, Tresp V, Schwaighofer A. Learning Gaussian processes from multiple tasks. Proceedings of the 22nd international conference on Machine Learning.2005.
  • Yuan M, Ekici A, Lu Z, Monteiro R. Dimension reduction and coefficient estimation in multivariate linear regression. Journal of the Royal Statistical Society: Series B. 2007;69:329–346.
  • Yuan M, Lin Y. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B. 2005;68:49–67.
  • Zhang J. A probabilistic framework for multi-task learning. Carnegie Mellon University, Language Technologies Institute, School of Computer Science; 2006. Ph.D. thesis.
  • Zhang J, Ghahramani Z, Yang Y. Learning multiple related tasks using latent independent component analysis. Advances in Neural Information Processing Systems. 2006
  • Zhao T, Liu H. Sparse additive machine. Proceedings of the 15th International Conference on Artificial Intelligence and Statistics.2012.
  • Zhao T, Liu H. Calibrated precision matrix estimation for high-dimensional elliptical distributions. IEEE Transactions on Information Theory. 2014;60:7874–7887. [PMC free article] [PubMed]
  • Zhao T, Liu H, Zhang T. A general theory of pathwise coordinate optimization. arXiv preprint arXiv:1412.7477. 2014a
  • Zhao T, Roeder K, Liu H. Positive semidefinite rank-based correlation matrix estimation with application to semiparametric graph estimation. Journal of Computational and Graphical Statistics. 2014b;23:895–922. [PMC free article] [PubMed]
  • Zhao T, Yu M, Wang Y, Arora R, Liu H. Accelerated mini-batch randomized block coordinate descent method. Advances in Neural Information Processing Systems. 2014c [PMC free article] [PubMed]
  • Zhou J, Chen J, Ye J. Tech. rep. Arizona State University; 2012. MALSAR: Multi-task learning via structural regularization. URL http://www.public.asu.edu/~jye02/Software/MALSAR.