Home | About | Journals | Submit | Contact Us | Français |

**|**HHS Author Manuscripts**|**PMC5354362

Formats

Article sections

- Abstract
- 1 Introduction
- 2 Sparse tensor graphical model
- 3 Theory of statistical optimization
- 4 Simulations
- Supplementary Material
- References

Authors

Related links

Adv Neural Inf Process Syst. Author manuscript; available in PMC 2017 March 16.

Published in final edited form as:

Adv Neural Inf Process Syst. 2015; 28: 1081–1089.

PMCID: PMC5354362

NIHMSID: NIHMS752309

Wei Sun, Yahoo Labs, Sunnyvale, CA;

We consider the estimation of sparse graphical models that characterize the dependency structure of high-dimensional tensor-valued data. To facilitate the estimation of the precision matrix corresponding to each way of the tensor, we assume the data follow a tensor normal distribution whose covariance has a Kronecker product structure. The penalized maximum likelihood estimation of this model involves minimizing a non-convex objective function. In spite of the non-convexity of this estimation problem, we prove that an alternating minimization algorithm, which iteratively estimates each sparse precision matrix while fixing the others, attains an estimator with the optimal statistical rate of convergence as well as consistent graph recovery. Notably, such an estimator achieves estimation consistency with only one tensor sample, which is unobserved in previous work. Our theoretical results are backed by thorough numerical studies.

High-dimensional tensor-valued data are prevalent in many fields such as personalized recommendation systems and brain imaging research [1, 2]. Traditional recommendation systems are mainly based on the user-item matrix, whose entry denotes each user’s preference for a particular item. To incorporate additional information into the analysis, such as the temporal behavior of users, we need to consider a user-item-time tensor. For another example, functional magnetic resonance imaging (fMRI) data can be viewed as a three way (third-order) tensor since it contains the brain measurements taken on different locations over time for various experimental conditions. Also, in the example of microarray study for aging [3], thousands of gene expression measurements are recorded on 16 tissue types on 40 mice with varying ages, which forms a four way gene-tissue-mouse-age tensor.

In this paper, we study the estimation of conditional independence structure within tensor data. For example, in the microarray study for aging we are interested in the dependency structure across different genes, tissues, ages and even mice. Assuming data are drawn from a tensor normal distribution, a straightforward way to estimate this structure is to vectorize the tensor and estimate the underlying Gaussian graphical model associated with the vector. Such an approach ignores the tensor structure and requires estimating a rather high dimensional precision matrix with insufficient sample size. For instance, in the aforementioned fMRI application the sample size is one if we aim to estimate the dependency structure across different locations, time and experimental conditions. To address such a problem, a popular approach is to assume the covariance matrix of the tensor normal distribution is separable in the sense that it is the Kronecker product of small covariance matrices, each of which corresponds to one way of the tensor. Under this assumption, our goal is to estimate the precision matrix corresponding to each way of the tensor. See § 1.1 for a detailed survey of previous work.

Despite the fact that the assumption of the Kronecker product structure of covariance makes the statistical model much more parsimonious, it poses significant challenges. In particular, the penalized negative log-likelihood function is non-convex with respect to the unknown sparse precision matrices. Consequently, there exists a gap between computational and statistical theory. More specifically, as we will show in §1.1, existing literature mostly focuses on establishing the existence of a local optimum that has desired statistical guarantees, rather than offering efficient algorithmic procedures that provably achieve the desired local optima. In contrast, we analyze an alternating minimization algorithm which iteratively minimizes the non-convex objective function with respect to each individual precision matrix while fixing the others. The established theoretical guarantees of the proposed algorithm are as follows. Suppose that we have *n* observations from a *K*-th order tensor normal distribution. We denote by *m _{k}, s_{k}, d_{k}* (

A special case of our sparse tensor graphical model when *K* = 2 is the sparse matrix graphical model, which is studied by [5–8]. In particular, [5] and [6] only establish the existence of a local optima with desired statistical guarantees. Meanwhile, [7] considers an algorithm that is similar to ours. However, the statistical rates of convergence obtained by [6, 7] are much slower than ours when *K* = 2. See Remark 3.6 in §3.1 for a detailed comparison. For *K* = 2, our statistical rate of convergence in Frobenius norm recovers the result of [5]. In other words, our theory confirms that the desired local optimum studied by [5] not only exists, but is also attainable by an efficient algorithm. In addition, for matrix graphical model, [8] establishes the statistical rates of convergence in spectral and Frobenius norms for the estimator attained by a similar algorithm. Their results achieve estimation consistency in spectral norm with only one matrix observation. However, their rate is slower than ours with *K* = 2. See Remark 3.11 in §3.2 for a detailed discussion. Furthermore, we allow *K* to increase and establish estimation consistency even in Frobenius norm for *n* = 1. Most importantly, all these results focus on matrix graphical model and can not handle the aforementioned motivating applications such as the gene-tissue-mouse-age tensor dataset.

In the context of sparse tensor graphical model with a general *K*, [9] shows the existence of a local optimum with desired rates, but does not prove whether there exists an efficient algorithm that provably attains such a local optimum. In contrast, we prove that our alternating minimization algorithm achieves an estimator with desired statistical rates. To achieve it, we apply a novel theoretical framework to separately consider the population and sample optimizers, and then establish the onestep convergence for the population optimizer (Theorem 3.1) and the optimal rate of convergence for the sample optimizer (Theorem 3.4). A new concentration result (Lemma B.1) is developed for this purpose, which is also of independent interest. Moreover, we establish additional theoretical guarantees including the optimal rate of convergence in max norm, the estimation consistency in spectral norm, and the graph recovery consistency of the proposed sparse precision matrix estimator.

In addition to the literature on graphical models, our work is also closely related to a recent line of research on alternating minimization for non-convex optimization problems [10–13]. These existing results mostly focus on problems such as dictionary learning, phase retrieval and matrix decomposition. Hence, our statistical model and analysis are completely different from theirs. Also, our paper is related to a recent line of work on tensor decomposition. See, e.g., [14–17] and the references therein. Compared with them, our work focuses on the graphical model structure within tensor-valued data.

For a matrix **A** = (**A**_{i,j}) ^{d×d}, we denote ‖**A**‖∞, ‖**A**‖_{2}, ‖**A**‖_{F} as its max, spectral, and Frobenius norm, respectively. We define ‖**A**‖_{1,off} := ∑_{i≠j} |**A**_{i,j}| as its off-diagonal _{1} norm and |‖**A**‖|_{∞} := max_{i} ∑_{j} |**A**_{i,j}| as the maximum absolute row sum. Denote vec(**A**) as the vectorization of **A** which stacks the columns of **A**. Let tr(**A**) be the trace of **A**. For an index set = {(*i, j*), *i, j* {1,…, *d*}}, we define [**A**]_{} as the matrix whose entry indexed by (*i, j*) is equal to **A**_{i,j}, and zero otherwise. We denote _{d} as the identity matrix with dimension *d*×*d*. Throughout this paper, we use *C, C*_{1}, *C*_{2},… to denote generic absolute constants, whose values may vary from line to line.

We employ the tensor notations used by [18]. Throughout this paper, higher order tensors are denoted by boldface Euler script letters, e.g. . We consider a *K*-th order tensor ^{m1×m2××mK}. When *K* = 1 it reduces to a vector and when *K* = 2 it reduces to a matrix. The (*i*_{1},…, *i _{K}*)-th element of the tensor is denoted to be

For tensors, a fiber refers to the higher order analogue of the row and column of matrices. A fiber is obtained by fixing all but one of the indices of the tensor, e.g., the mode-*k* fiber of _{(k)} is given by _{i1,…,,ik−1,:,ik+1,…,iK}. Matricization, also known as unfolding, is the process to transform a tensor into a matrix. We denote _{(k)} as the mode-*k* matricization of a tensor , which arranges the mode-*k* fibers to be the columns of the resulting matrix. Another useful operation in tensors is the *k*-mode product. The *k*-mode product of a tensor ^{m1×m2××mK} with a matrix **A** ^{J×mk} is denoted as ×_{k}
**A** and is of the size *m*_{1} × × *m*_{k−1} × *J* × *m*_{k+1} × × *m _{K}*. Its entry is defined as ${(\mathcal{T}{\times}_{k}\mathbf{A})}_{{i}_{1},\dots ,{i}_{k-1},j,{i}_{k+1},\dots ,{i}_{k}}\u2254{\sum}_{{i}_{k}=1}^{{m}_{k}}{\mathcal{T}}_{{i}_{1},\dots ,{i}_{k}}{\mathbf{A}}_{j,{i}_{k}}$. In addition, for a list of matrices {

A tensor ^{m1×m2××mK} follows the tensor normal distribution with zero mean and covariance matrices **Σ**_{1},…, **Σ**_{K}, denoted as ~ TN(**0, Σ**_{1},…, **Σ**_{K}), if its probability density function is

$$p(\mathcal{T}|{\mathbf{\Sigma}}_{1},\dots ,{\mathbf{\Sigma}}_{K})={(2\pi )}^{-m/2}\{\prod _{k=1}^{K}|{\mathbf{\Sigma}}_{k}{|}^{-m/(2{m}_{k})}\}\text{exp}(-{\Vert \mathcal{T}\times {\mathbf{\Sigma}}^{-1/2}\Vert}_{F}^{2}/2),$$

(2.1)

where $m={\prod}_{k=1}^{K}{m}_{k}$ and ${\mathbf{\Sigma}}^{-1/2}\u2254\{{\mathbf{\Sigma}}_{1}^{-1/2},\dots ,{\mathbf{\Sigma}}_{K}^{-1/2}\}$. When *K* = 1, this tensor normal distribution reduces to the vector normal distribution with zero mean and covariance **Σ**_{1}. According to [9, 18], it can be shown that ~ TN(**0, Σ**_{1},…, **Σ**_{K}) if and only if vec() ~ N(vec(**0**);**Σ**_{K} **Σ**_{1}), where vec(**0**) ^{m} and is the matrix Kronecker product.

We consider the parameter estimation for the tensor normal model. Assume that we observe independently and identically distributed tensor samples _{1},…, _{n} from TN(**0**; ${\mathbf{\Sigma}}_{1}^{*},\dots ,{\mathbf{\Sigma}}_{K}^{*}$). We aim to estimate the true covariance matrices (${\mathbf{\Sigma}}_{1}^{*},\dots ,{\mathbf{\Sigma}}_{K}^{*}$) and their corresponding true precision matrices (${\mathbf{\Omega}}_{1}^{*},\dots ,{\mathbf{\Omega}}_{K}^{*}$) where ${\mathbf{\Omega}}_{k}^{*}={\mathbf{\Sigma}}_{k}^{*-1}(k=1,\dots ,K)$. To address the identifiability issue in the parameterization of the tensor normal distribution, we assume that ${\Vert {\mathbf{\Omega}}_{k}^{*}\Vert}_{F}=1$ for *k* = 1,…, *K*. This renormalization assumption does not change the graph structure of the original precision matrix.

A standard approach to estimate ${\mathbf{\Omega}}_{k}^{*}$, *k* = 1,…, *K*, is to use the maximum likelihood method via (2.1). Up to a constant, the negative log-likelihood function of the tensor normal distribution is $\text{tr}[\mathbf{S}({\mathbf{\Omega}}_{K}\otimes \cdots \otimes {\mathbf{\Omega}}_{1})]-{\sum}_{k=1}^{K}(m/{m}_{k})\text{log}|{\mathbf{\Omega}}_{k}|$, where $\mathbf{S}\u2254\frac{1}{n}{\sum}_{i=1}^{n}\text{vec}({\mathcal{T}}_{i})\text{vec}{({\mathcal{T}}_{i})}^{\top}$. To encourage the sparsity of each precision matrix in the high-dimensional scenario, we consider a penalized log-likelihood estimator, which is obtained by minimizing

$${q}_{n}({\mathbf{\Omega}}_{1},\dots ,{\mathbf{\Omega}}_{K})\u2254\frac{1}{m}\text{tr}[\mathbf{S}({\mathbf{\Omega}}_{K}\otimes \cdots \otimes {\mathbf{\Omega}}_{1})]-\sum _{k=1}^{K}\frac{1}{{m}_{k}}\text{log}|{\mathbf{\Omega}}_{k}|+\sum _{k=1}^{K}{P}_{{\lambda}_{k}}({\mathbf{\Omega}}_{k}),$$

(2.2)

where *P*_{λk} (·) is a penalty function indexed by the tuning parameter λ_{k}. In this paper, we focus on the lasso penalty [19], i.e., *P*_{λk} (**Ω**_{k}) = λ_{k}‖**Ω**_{k}‖_{1,off}. This estimation procedure applies similarly to a broad family of other penalty functions.

We name the penalized model from (2.2) as the sparse tensor graphical model. It reduces to the sparse vector graphical model [20, 21] when *K* = 1, and the sparse matrix graphical model [5–8] when *K* = 2. Our framework generalizes them to fulfill the demand of capturing the graphical structure of higher order tensor-valued data.

This section introduces the estimation procedure for the sparse tensor graphical model. A computationally efficient algorithm is provided to estimate the precision matrix for each way of the tensor.

Recall that in (2.2), *q _{n}*(

According to its bi-convex property, we propose to solve this non-convex problem by alternatively update one precision matrix with other matrices fixed. Note that, for any *k* = 1,…, *K*, minimizing (2.2) with respect to **Ω**_{k} while fixing the rest *K* − 1 precision matrices is equivalent to minimizing

$$L({\mathbf{\Omega}}_{k})\u2254\frac{1}{{m}_{k}}\text{tr}({\mathbf{S}}_{k}{\mathbf{\Omega}}_{k})-\frac{1}{{m}_{k}}\text{log}|{\mathbf{\Omega}}_{k}|+{\lambda}_{k}{\Vert {\mathbf{\Omega}}_{k}\Vert}_{1,\text{off}}.$$

(2.3)

Here ${\mathbf{S}}_{k}\u2254\frac{{m}_{k}}{nm}{\sum}_{i=1}^{n}{\mathbf{V}}_{i}^{k}{\mathbf{V}}_{i}^{k\top}$, where ${\mathbf{V}}_{i}^{k}\u2254{[{\mathcal{T}}_{i}\times \{{\mathbf{\Omega}}_{1}^{1/2},\dots ,{\mathbf{\Omega}}_{k-1}^{1/2},{\mathrm{\U0001d7d9}}_{{m}_{k}},{\mathbf{\Omega}}_{k+1}^{1/2},\dots ,{\mathbf{\Omega}}_{K}^{1/2}\}]}_{(k)}$ with × the tensor product operation and [·]_{(k)} the mode-*k* matricization operation defined in §2.1. The result in (2.3) can be shown by noting that ${\mathbf{V}}_{i}^{k}={[{\mathcal{T}}_{i}]}_{(k)}{({\mathbf{\Omega}}_{K}^{1/2}\otimes \cdots \otimes {\mathbf{\Omega}}_{k+1}^{1/2}\otimes {\mathbf{\Omega}}_{k-1}^{1/2}\otimes \cdots \otimes {\mathbf{\Omega}}_{1}^{1/2})}^{\top}$ according to the properties of mode-*k* matricization shown by [18]. Hereafter, we drop the superscript *k* of ${\mathbf{V}}_{i}^{k}$ if there is no confusion. Note that minimizing (2.3) corresponds to estimating vector-valued Gaussian graphical model and can be solved efficiently via the glasso algorithm [21].

1: | Input: Tensor samples _{1}…, _{n}, tuning parameters λ_{1},…, λ_{K}, max number of iterations T. |

2: | Initialize
${\mathbf{\Omega}}_{1}^{(0)},\dots ,{\mathbf{\Omega}}_{K}^{(0)}$ randomly as symmetric and positive definite matrices and set t = 0. |

3: | Repeat: |

4: | t = t + 1. |

5: | For
k = 1,…, K: |

6: | Given ${\mathbf{\Omega}}_{1}^{(t)},\dots ,{\mathbf{\Omega}}_{k-1}^{(t)},{\mathbf{\Omega}}_{k+1}^{(t-1)},\dots ,{\mathbf{\Omega}}_{K}^{(t-1)}$, solve (2.3) for ${\mathbf{\Omega}}_{k}^{(t)}$ via glasso [21]. |

7: | Normalize ${\mathbf{\Omega}}_{k}^{(t)}$ such that ${\Vert {\mathbf{\Omega}}_{k}^{(t)}\Vert}_{F}=1$. |

8: | End For |

9: | Until
t = T. |

10: | Output:
${\hat{\mathbf{\Omega}}}_{k}={\mathbf{\Omega}}_{k}^{(T)}(k=1,\dots ,K)$. |

The details of our Tensor lasso (Tlasso) algorithm are shown in Algorithm 1. It starts with a random initialization and then alternatively updates each precision matrix until it converges. In §3, we will illustrate that the statistical properties of the obtained estimator are insensitive to the choice of the initialization (see the discussion following Theorem 3.5).

We first prove the estimation errors in Frobenius norm, max norm, and spectral norm, and then provide the model selection consistency of our Tlasso estimator. We defer all the proofs to the appendix.

Based on the penalized log-likelihood in (2.2), we define the population log-likelihood function as

$$q({\mathbf{\Omega}}_{1},\dots ,{\mathbf{\Omega}}_{K})\u2254\frac{1}{m}\mathbb{E}\{\text{tr}[\text{vec}(\mathcal{T})\text{vec}{(\mathcal{T})}^{\top}({\mathbf{\Omega}}_{K}\otimes \cdots \otimes {\mathbf{\Omega}}_{1})]\}-\sum _{k=1}^{K}\frac{1}{{m}_{k}}\text{log}|{\mathbf{\Omega}}_{k}|.$$

(3.1)

By minimizing *q*(**Ω**_{1},…, **Ω**_{K}) with respect to **Ω**_{k}, *k* = 1,…, *K*, we obtain the population minimization function with the parameter **Ω**_{[K]−k} := {**Ω**_{1},…, **Ω**_{k−1}, **Ω**_{k+1},…, **Ω**_{K}}, i.e.,

$${M}_{k}({\mathbf{\Omega}}_{[K]-k})\u2254\underset{{\mathbf{\Omega}}_{k}}{\text{argmin}}q({\mathbf{\Omega}}_{1},\dots ,{\mathbf{\Omega}}_{K}).$$

(3.2)

For any *k* = 1,…, *K*, if **Ω**_{j} (*j* ≠ *k*) satisfies $\text{tr}({\mathbf{\Sigma}}_{j}^{*}{\mathbf{\Omega}}_{j})\ne 0$, then the population minimization function in (3.2) satisfies ${M}_{k}({\mathbf{\Omega}}_{[K]-k})=m{[{m}_{k}{\prod}_{j\ne k}\text{tr}({\mathbf{\Sigma}}_{j}^{*}{\mathbf{\Omega}}_{j})]}^{-1}{\mathbf{\Omega}}_{k}^{*}$.

Theorem 3.1 shows a surprising phenomenon that the population minimization function recovers the true precision matrix up to a constant in only one iteration. If ${\mathbf{\Omega}}_{j}={\mathbf{\Omega}}_{j}^{*}$, *j* ≠ *k*, then ${M}_{k}({\mathbf{\Omega}}_{[K]-k})={\mathbf{\Omega}}_{k}^{*}$. Otherwise, after a normalization such that ‖*M _{k}*(

In practice, when (3.1) is unknown, we can approximate it via its sample version *q _{n}*(

$${\hat{M}}_{k}({\mathbf{\Omega}}_{[K]-k})\u2254\underset{{\mathbf{\Omega}}_{k}}{\text{argmin}}{q}_{n}({\mathbf{\Omega}}_{1},\dots ,{\mathbf{\Omega}}_{K}).$$

(3.3)

In order to prove the estimation error, it remains to quantify the statistical error induced from finite samples. The following two regularity conditions are assumed for this purpose.

(Bounded Eigenvalues). For any *k* = 1,…, *K*, there is a constant *C*_{1} > 0 such that,

$$0<{C}_{1}\le {\lambda}_{\mathrm{min}}({\mathbf{\Sigma}}_{k}^{*})\le {\lambda}_{\mathrm{max}}({\mathbf{\Sigma}}_{k}^{*})\le 1/{C}_{1}<\infty ,$$

where ${\lambda}_{\mathrm{min}}({\mathbf{\Sigma}}_{k}^{*})$ and ${\lambda}_{\mathrm{max}}({\mathbf{\Sigma}}_{k}^{*})$ refer to the minimal and maximal eigenvalue of ${\mathbf{\Sigma}}_{k}^{*}$, respectively.

Condition 3.2 requires the uniform boundedness of the eigenvalues of true covariance matrices ${\mathbf{\Sigma}}_{k}^{*}$. It has been commonly assumed in the graphical model literature [22].

(Tuning). For any *k* = 1,…, *K* and some constant *C*_{2} > 0, the tuning parameter λ_{k} satisfies $1/{C}_{2}\sqrt{\text{log}{m}_{k}/(nm{m}_{k})}\le {\lambda}_{k}\le {C}_{2}\sqrt{\text{log}{m}_{k}/(nm{m}_{k})}$.

Condition 3.3 specifies the choice of the tuning parameters. In practice, a data-driven tuning procedure [23] can be performed to approximate the optimal choice of the tuning parameters.

Before characterizing the statistical error, we define a sparsity parameter for ${\mathbf{\Omega}}_{k}^{*}$, *k* = 1,…, *K*. Let ${\mathbb{S}}_{k}\u2254\{(i,j):{[{\mathbf{\Omega}}_{k}^{*}]}_{i,j}\ne 0\}$. Denote the sparsity parameter *s _{k}* := |

$$\mathbb{B}({\mathbf{\Omega}}_{k}^{*})\u2254\{\mathbf{\Omega}\in {\mathbb{R}}^{{m}_{k}\times {m}_{k}}:\mathbf{\Omega}={\mathbf{\Omega}}^{\top};\mathbf{\Omega}\succ 0;{\Vert \mathbf{\Omega}-{\mathbf{\Omega}}_{k}^{*}\Vert}_{F}\le \alpha \}.$$

(3.4)

Assume Conditions 3.2 and 3.3 hold. For any *k* = 1,…, *K*, the statistical error of the sample-based minimization function defined in (3.3) satisfies that, for any fixed ${\mathbf{\Omega}}_{j}\in \mathbb{B}({\mathbf{\Omega}}_{j}^{*})(j\ne k)$,

$${\Vert {\hat{M}}_{k}({\mathbf{\Omega}}_{[K]-k})-{M}_{k}({\mathbf{\Omega}}_{[K]-k})\Vert}_{F}={O}_{P}\left(\sqrt{\frac{{m}_{k}({m}_{k}+{s}_{k})\text{log}{m}_{k}}{nm}}\right),$$

(3.5)

where *M _{k}*(

Theorem 3.4 establishes the statistical error associated with * _{k}*(

According to Theorem 3.1 and Theorem 3.4, we obtain the rate of convergence of the Tlasso estimator in terms of Frobenius norm, which is our main result.

Assume that Conditions 3.2 and 3.3 hold. For any *k* = 1,…, *K*, if the initialization satisfies ${\mathbf{\Omega}}_{j}^{(0)}\in \mathbb{B}({\mathbf{\Omega}}_{j}^{*})$ for any *j* ≠ *k*, then the estimator _{k} from Algorithm 1 with *T* = 1 satisfies,

$${\Vert {\hat{\mathbf{\Omega}}}_{k}-{\mathbf{\Omega}}_{k}^{*}\Vert}_{F}={O}_{P}\left(\sqrt{\frac{{m}_{k}({m}_{k}+{s}_{k})\text{log}{m}_{k}}{nm}}\right),$$

(3.6)

where $m={\prod}_{k=1}^{K}{m}_{k}$ and $\mathbb{B}({\mathbf{\Omega}}_{j}^{*})$ is defined in (3.4).

Theorem 3.5 suggests that as long as the initialization is within a constant distance to the truth, our Tlasso algorithm attains a consistent estimator after only one iteration. This initialization condition ${\mathbf{\Omega}}_{j}^{(0)}\in \mathbb{B}({\mathbf{\Omega}}_{j}^{*})$ trivially holds since for any ${\mathbf{\Omega}}_{j}^{(0)}$ that is positive definite and has unit Frobenius norm, we have ${\Vert {\mathbf{\Omega}}_{j}^{(0)}-{\mathbf{\Omega}}_{k}^{*}\Vert}_{F}\le 2$ by noting that ${\Vert {\mathbf{\Omega}}_{k}^{*}\Vert}_{F}=1(k=1,\dots ,K)$ for the identifiability of the tensor normal distribution. In literature, [9] shows that there exists a local minimizer of (2.2) whose convergence rate can achieve (3.6). However, it is unknown if their algorithm can find such minimizer since there could be many other local minimizers.

A notable implication of Theorem 3.5 is that, when *K* ≥ 3, the estimator from our Tlasso algorithm can achieve estimation consistency even if we only have access to one observation, i.e., *n* = 1, which is often the case in practice. To see it, suppose that *K* = 3 and *n* = 1. When the dimensions *m*_{1}, *m*_{2}, and *m*_{3} are of the same order of magnitude and *s _{k}* =

This result indicates that the estimation of the *k*-th precision matrix takes advantage of the information from the *j*-th way (*j* ≠ *k*) of the tensor data. Consider a simple case that *K* = 2 and one precision matrix ${\mathbf{\Omega}}_{1}^{*}={\mathrm{\U0001d7d9}}_{{m}_{1}}$ is known. In this scenario the rows of the matrix data are independent and hence the effective sample size for estimating ${\mathbf{\Omega}}_{2}^{*}$ is in fact *nm*_{1}. The optimality result for the vector-valued graphical model [4] implies that the optimal rate for estimating ${\mathbf{\Omega}}_{2}^{*}$ is $\sqrt{({m}_{2}+{s}_{2})\text{log}{m}_{2}/(n{m}_{1})}$, which matches our result in (3.6). Therefore, the rate in (3.6) obtained by our Tlasso estimator is minimax-optimal since it is the best rate one can obtain even when ${\mathbf{\Omega}}_{j}^{*}(j\ne k)$ are known. As far as we know, this phenomenon has not been discovered by any previous work in tensor graphical model.

For *K* = 2, our tensor graphical model reduces to matrix graphical model with Kronecker product covariance structure [5–8]. In this case, the rate of convergence of _{1} in (3.6) reduces to $\sqrt{({m}_{1}+{s}_{1})\text{log}{m}_{1}/(n{m}_{2})}$, which is much faster than $\sqrt{{m}_{2}({m}_{1}+{s}_{1})(\text{log}{m}_{1}+\mathrm{log}{m}_{2}/n}$ established by [6] and $\sqrt{({m}_{1}+{m}_{2})\text{log}[\text{max}({m}_{1},{m}_{2},n)]/(n{m}_{2})}$ established by [7]. In literature, [5] shows that there exists a local minimizer of the objective function whose estimation errors match ours. However, it is unknown if their estimator can achieve such convergence rate. On the other hand, our theorem confirms that our algorithm is able to find such estimator with optimal rate of convergence.

We next show the estimation error in max norm and spectral norm. Trivially, these estimation errors are bounded by that in Frobenius norm shown in Theorem 3.5. To develop improved rates of convergence in max and spectral norms, we need to impose stronger conditions on true parameters.

We first introduce some important notations. Denote *d _{k}* as the maximum number of non-zeros in any row of the true precision matrices ${\mathbf{\Omega}}_{k}^{*}$, that is,

$${d}_{k}\u2254\underset{i\in \{1,\dots ,{m}_{k}\}}{\text{max}}|\{j\in \{1,\dots ,{m}_{k}\}:{[{\mathbf{\Omega}}_{k}^{*}]}_{i,j}\ne 0\}|,$$

(3.7)

with |·| the cardinality of the inside set. For each covariance matrix ${\mathbf{\Sigma}}_{k}^{*}$, we define ${\kappa}_{{\mathbf{\Sigma}}_{k}^{*}}\u2254{|\Vert {\mathbf{\Sigma}}_{k}^{*}\Vert |}_{\infty}$. Denote the Hessian matrix ${\mathbf{\Gamma}}_{k}^{*}\u2254{\mathbf{\Omega}}_{k}^{*-1}\otimes {\mathbf{\Omega}}_{k}^{*-1}\in {\mathbb{R}}^{{m}_{k}^{2}\times {m}_{k}^{2}}$, whose entry ${[{\mathbf{\Gamma}}_{k}^{*}]}_{(i,j),(s,t)}$ corresponds to the second order partial derivative of the objective function with respect to [**Ω**_{k}]_{i,j} and [**Ω**_{k}]_{s,t}. We define its sub-matrix indexed by the index set _{k} as ${[{\mathbf{\Gamma}}_{k}^{*}]}_{{\mathbb{S}}_{k},{\mathbb{S}}_{k}}={[{\mathbf{\Omega}}_{k}^{*-1}\otimes {\mathbf{\Omega}}_{k}^{*-1}]}_{{\mathbb{S}}_{k},{\mathbb{S}}_{k}}$, which is the |_{k}| × |_{k}| matrix with rows and columns of ${\mathbf{\Gamma}}_{k}^{*}$ indexed by _{k} and _{k}, respectively. Moreover, we define ${\kappa}_{{\mathbf{\Gamma}}_{k}^{*}}\u2254{|\Vert {({[{\mathbf{\Gamma}}_{k}^{*}]}_{{\mathbb{S}}_{k},{\mathbb{S}}_{k}})}^{-1}\Vert |}_{\infty}$. In order to establish the rate of convergence in max norm, we need to impose an irrepresentability condition on the Hessian matrix.

(Irrepresentability). For each *k* = 1,…, *K*, there exists some α_{k} (0, 1] such that max

$$\underset{e\in {\mathbb{S}}_{k}^{c}}{\text{max}}{\Vert {[{\mathbf{\Gamma}}_{k}^{*}]}_{e,{\mathbb{S}}_{k}}{({[{\mathbf{\Gamma}}_{k}^{*}]}_{{\mathbb{S}}_{k},{\mathbb{S}}_{k}})}^{-1}\Vert}_{1}\le 1-{\alpha}_{k}.$$

Condition 3.7 controls the influence of the non-connected terms in ${\mathbb{S}}_{k}^{c}$ on the connected edges in _{k}. This condition has been widely applied in lasso penalized models [26, 27].

(Bounded Complexity). For each *k* = 1,…, *K*, the parameters ${\kappa}_{{\mathbf{\Sigma}}_{k}^{*}},{\kappa}_{{\mathbf{\Gamma}}_{k}^{*}}$ are bounded and the parameter *d _{k}* in (3.7) satisfies ${d}_{k}=o(\sqrt{nm}/({m}_{k}\text{log}{m}_{k}))$.

Suppose Conditions 3.2, 3.3, 3.7 and 3.8 hold. Assume *s _{k}* =

$${\Vert {\hat{\mathbf{\Omega}}}_{k}-{\mathbf{\Omega}}_{k}^{*}\Vert}_{\infty}={O}_{P}\left(\sqrt{\frac{{m}_{k}\text{log}{m}_{k}}{nm}}\right).$$

(3.8)

In addition, the edge set of _{k} is a subset of the true edge set of ${\mathbf{\Omega}}_{k}^{*}$, that is, $\text{supp}({\hat{\mathbf{\Omega}}}_{k})\subseteq \text{supp}({\mathbf{\Omega}}_{k}^{*})$.

Theorem 3.9 shows that our Tlasso estimator achieves the optimal rate of convergence in max norm [4]. Here we consider the estimator obtained after two iterations since we require a new concentration inequality (Lemma B.3) for the sample covariance matrix, which is built upon the estimator in Theorem 3.5. A direct consequence from Theorem 3.9 is the estimation error in spectral norm.

Suppose the conditions of Theorem 3.9 hold, for any *k* = 1,…, *K*, we have

$${\Vert {\hat{\mathbf{\Omega}}}_{k}-{\mathbf{\Omega}}_{k}^{*}\Vert}_{2}={O}_{P}\left({d}_{k}\sqrt{\frac{{m}_{k}\text{log}{m}_{k}}{nm}}\right).$$

(3.9)

Now we compare our obtained rate of convergence in spectral norm for *K* = 2 with that established in the sparse matrix graphical model literature. In particular, [8] establishes the rate of ${O}_{P}(\sqrt{{m}_{k}({s}_{k}\vee 1)\text{log}({m}_{1}\vee {m}_{2})/(n{m}_{k})})$ for *k* = 1, 2. Therefore, when ${d}_{k}^{2}\le ({s}_{k}\vee 1)$, which holds for example in the bounded degree graphs, our obtained rate is faster. However, our faster rate comes at the price of assuming the irrepresentability condition. Using recent advance in nonconvex regularization [28], we can eliminate the irrepresentability condition. We leave this to future work.

Theorem 3.9 ensures that the estimated precision matrix correctly excludes all non-informative edges and includes all the true edges (*i, j*) with $|{[{\mathbf{\Omega}}_{k}^{*}]}_{i,j}|>C\sqrt{{m}_{k}\text{log}{m}_{k}/(nm)}$ for some constant *C* > 0. Therefore, in order to achieve the model selection consistency, a sufficient condition is to assume that, for each *k* = 1,…, *K*, the minimal signal ${\theta}_{k}\u2254{\text{min}}_{(i,j)\in \text{supp}({\mathbf{\Omega}}_{k}^{*})}{[{\mathbf{\Omega}}_{k}^{*}]}_{i,j}$ is not too small.

Under the conditions of Theorem 3.9, if ${\theta}_{k}\ge C\sqrt{{m}_{k}\text{log}{m}_{k}/(nm)}$ for some constant *C* > 0, then for any *k* = 1,…, *K*, $\text{sign}({\hat{\mathbf{\Omega}}}_{k})=\text{sign}({\mathbf{\Omega}}_{k}^{*})$, with high probability.

Theorem 3.12 indicates that our Tlasso estimator is able to correctly recover the graphical structure of each way of the high-dimensional tensor data. To the best of our knowledge, these is the first model selection consistency result in high dimensional tensor graphical model.

We compare the proposed Tlasso estimator with two alternatives. The first one is the direct graphical lasso (Glasso) approach [21] which applies the glasso to the vectorized tensor data to estimate ${\mathbf{\Omega}}_{1}^{*}\otimes \cdots \otimes {\mathbf{\Omega}}_{K}^{*}$ directly. The second alternative method is the iterative penalized maximum likelihood method (P-MLE) proposed by [9], whose termination condition is set to be ${\sum}_{k=1}^{K}{\Vert {\hat{\mathbf{\Omega}}}_{k}^{(t)}-{\hat{\mathbf{\Omega}}}_{k}^{(t-1)}\Vert}_{F}/K\le 0.001$.

For simplicity, in our Tlasso algorithm we set the initialization of *k*-th precision matrix as _{mk} for each *k* = 1,…, *K* and the total iteration *T* = 1. The tuning parameter λ_{k} is set as $20\sqrt{\text{log}{m}_{k}/(nm{m}_{k})}$. For a fair comparison, the same tuning parameter is applied in the P-MLE method. In the direct Glasso approach, its tuning parameter is chosen by cross-validation via *huge* package [29].

We consider two simulations with a third order tensor, i.e., *K* = 3. In Simulation 1, we construct a triangle graph, while in Simulation 2, we construct a four nearest neighbor graph for each precision matrix. An illustration of the generated graphs are shown in Figure 1. In each simulation, we consider three scenarios, i.e., s1: *n* = 10 and (*m*_{1}, *m*_{2}, *m*_{3}) = (10, 10, 10); s2: *n* = 50 and (*m*_{1}, *m*_{2}, *m*_{3}) = (10, 10, 10); s3: *n* = 10 and (*m*_{1}, *m*_{2}, *m*_{3}) = (100, 5, 5). We repeat each example 100 times and compute the averaged computational time, the averaged estimation error of the Kronecker product of precision matrices ${({m}_{1}{m}_{2}{m}_{3})}^{-1}{\Vert {\hat{\mathbf{\Omega}}}_{1}\otimes \cdots \otimes {\hat{\mathbf{\Omega}}}_{K}-{\mathbf{\Omega}}_{1}^{*}\otimes \cdots \otimes {\mathbf{\Omega}}_{K}^{*}\Vert}_{F}$, the true positive rate (TPR), and the true negative rate (TNR). More specifically, we denote ${a}_{i,j}^{*}$ be the (*i, j*)-th entry of ${\mathbf{\Omega}}_{1}^{*}\otimes \cdots \otimes {\mathbf{\Omega}}_{K}^{*}$, and define $\mathrm{T}\mathrm{P}\mathrm{R}\u2254{\sum}_{i,j}\mathrm{\U0001d7d9}({\xe2}_{i,j}\ne 0,{a}_{i,j}^{*}\ne 0)/{\sum}_{i,j}\mathrm{\U0001d7d9}({a}_{i,j}^{*}\ne 0)$ and $\mathrm{T}\mathrm{N}\mathrm{R}\u2254{\sum}_{i,j}\mathrm{\U0001d7d9}({\xe2}_{i,j}=0,{a}_{i,j}^{*}=0)/{\sum}_{i}\mathrm{\U0001d7d9}({a}_{i,j}^{*}=0)$.

Left two plots: illustrations of the generated graphs; Middle two plots: computational time; Right two plots: estimation errors. In each group of two plots, the left (right) is for Simulation 1 (2).

As shown in Figure 1, our Tlasso is dramatically faster than both alternative methods. In Scenario s3, Tlasso takes about five seconds for each replicate, the P-MLE takes about 500 seconds while the direct Glasso method takes more than one hour and is omitted in the plot. Tlasso algorithm is not only computationally efficient but also enjoys superior estimation accuracy. In all examples, the direct Glasso method has significantly larger errors than Tlasso due to ignoring the tensor graphical structure. Tlasso outperforms P-MLE in Scenarios s1 and s2 and is comparable to it in Scenario s3.

Table 1 shows the variable selection performance. Our Tlasso identifies almost all edges in these six examples, while the Glasso and P-MLE method miss several true edges. On the other hand, Tlasso tends to include more non-connected edges than other methods.

We would like to thank the anonymous reviewers for their helpful comments. Han Liu is grateful for the support of NSF CAREER Award DMS1454377, NSF IIS1408910, NSF IIS1332109, NIH R01MH102339, NIH R01GM083084, and NIH R01HG06841. Guang Cheng’s research is sponsored by NSF CAREER Award DMS1151692, NSF DMS1418042, Simons Fellowship in Mathematics, ONR N00014-15-1-2331 and a grant from Indiana Clinical and Translational Sciences Institute.

Wei Sun, Yahoo Labs, Sunnyvale, CA.

Zhaoran Wang, Department of Operations Research, and Financial Engineering, Princeton University, Princeton, NJ.

Han Liu, Department of Operations Research, and Financial Engineering, Princeton University, Princeton, NJ.

Guang Cheng, Department of Statistics, Purdue University, West Lafayette, IN.

1. Rendle S, Schmidt-Thieme L. Pairwise interaction tensor factorization for personalized tag recommendation; International Conference on Web Search and Data Mining; 2010.

2. Allen GI. Sparse higher-order principal components analysis; International Conference on Artificial Intelligence and Statistics; 2012.

3. Zahn J, Poosala S, Owen A, Ingram D, et al. AGEMAP: A gene expression database for aging in mice. PLOS Genetics. 2007;3:2326–2337. [PMC free article] [PubMed]

4. Cai T, Liu W, Zhou HH. Estimating sparse precision matrix: Optimal rates of convergence and adaptive estimation. Annals of Statistics. 2015

5. Leng C, Tang CY. Sparse matrix graphical models. Journal of the American Statistical Association. 2012;107:1187–1200.

6. Yin J, Li H. Model selection and estimation in the matrix normal graphical model. Journal of Multivariate Analysis. 2012;107:119–140. [PMC free article] [PubMed]

7. Tsiligkaridis T, Hero AO, Zhou S. On convergence of Kronecker graphical Lasso algorithms. IEEE Transactions on Signal Processing. 2013;61:1743–1755.

8. Zhou S. Gemini: Graph estimation with matrix variate normal instances. Annals of Statistics. 2014;42:532–562.

9. He S, Yin J, Li H, Wang X. Graphical model selection and estimation for high dimensional tensor data. Journal of Multivariate Analysis. 2014;128:165–185.

10. Jain P, Netrapalli P, Sanghavi S. Low-rank matrix completion using alternating minimization; Symposium on Theory of Computing; 2013. pp. 665–674.

11. Netrapalli P, Jain P, Sanghavi S. Phase retrieval using alternating minimization. Advances in Neural Information Processing Systems. 2013:2796–2804.

12. Sun J, Qu Q, Wright J. Complete dictionary recovery over the sphere. arXiv:1504.06785. 2015

13. Arora S, Ge R, Ma T, Moitra A. Simple, efficient, and neural algorithms for sparse coding. arXiv:1503.00778. 2015

14. Anandkumar A, Ge R, Hsu D, Kakade S, Telgarsky M. Tensor decompositions for learning latent variable models. Journal of Machine Learning Research. 2014;15:2773–2832.

15. Sun W, Lu J, Liu H, Cheng G. Provable sparse tensor decomposition. arXiv:1502.01425. 2015

16. Zhe S, Xu Z, Chu X, Qi Y, Park Y. Scalable nonparametric multiway data analysis; International Conference on Artificial Intelligence and Statistics; 2015.

17. Zhe S, Xu Z, Qi Y, Yu P. Sparse bayesian multiview learning for simultaneous association discovery and diagnosis of alzheimer’s disease. Twenty-Ninth AAAI Conference on Artificial Intelligence. 2015

18. Kolda T, Bader B. Tensor decompositions and applications. SIAM Review. 2009;51:455–500.

19. Tibshirani R. Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society, Series B. 1996;58:267–288.

20. Yuan M, Lin Y. Model selection and estimation in the gaussian graphical model. Biometrika. 2007;94:19–35.

21. Friedman J, Hastie H, Tibshirani R. Sparse inverse covariance estimation with the graphical Lasso. Biostatistics. 2008;9:432–441. [PMC free article] [PubMed]

22. Rothman AJ, Bickel PJ, Levina E, Zhu J. Sparse permutation invariant covariance estimation. Electronic Journal of Statistics. 2008;2:494–515.

23. Sun W, Wang J, Fang Y. Consistent selection of tuning parameters via variable selection stability. Journal of Machine Learning Research. 2013;14:3419–3440.

24. Ledoux M, Talagrand M. Probability in Banach Spaces: Isoperimetry and Processes. Springer; 2011.

25. Fan J, Feng Y, Wu Y. Network exploration via the adaptive Lasso and scad penalties. Annals of Statistics. 2009;3:521–541. [PMC free article] [PubMed]

26. Zhao P, Yu B. On model selection consistency of Lasso. Journal of Machine Learning Research. 2006;7:2541–2567.

27. Ravikumar P, Wainwright MJ, Raskutti G, Yu B. High-dimensional covariance estimation by minimizing _{1}-penalized log-determinant divergence. Electronic Journal of Statistics. 2011;5:935–980.

28. Loh Po-Ling, Wainwright Martin J. Support recovery without incoherence: A case for nonconvex regularization. arXiv preprint arXiv:1412.5632. 2014

29. Zhao T, Liu H, Roeder K, Lafferty J, Wasserman L. The huge package for high-dimensional undirected graph estimation in R. Journal of Machine Learning Research. 2012;13:1059–1062. [PMC free article] [PubMed]

30. Gupta A, Nagar D. Matrix variate distributions. Chapman and Hall/CRC Press; 2000.

31. Hoff P. Separable covariance arrays via the Tucker product, with applications to multivariate relational data. Bayesian Analysis. 2011;6:179–196.

32. Dawid AP. Some matrix-variate distribution theory: Notational considerations and a bayesian application. Biometrika. 1981;68:265–274.

33. Negahban S, Wainwright MJ. Estimation of (near) low-rank matrices with noise and high-dimensional scaling. Annals of Statistics. 2011;39:1069–1097.