Home | About | Journals | Submit | Contact Us | Français |

**|**Biostatistics**|**PMC3019769

Formats

Article sections

- Abstract
- 1. INTRODUCTION
- 2. THE PROPOSED METHOD
- 3. TIMING COMPARISONS
- 4. ANALYSIS OF CELL-SIGNALLING DATA
- 5. DISCUSSION
- FUNDING
- References

Authors

Related links

Biostatistics. 2008 July; 9(3): 432–441.

Published online 2007 December 12. doi: 10.1093/biostatistics/kxm045

PMCID: PMC3019769

NIHMSID: NIHMS248717

Jerome Friedman

Department of Statistics, Stanford University, CA 94305, USA

Trevor Hastie

Department of Statistics and Department of Health Research & Policy, Stanford University, CA 94305, USA

Department of Health Research & Policy and Department of Statistics, Stanford University, CA 94305, USA ; Email: ude.drofnats@sbit

Received 2007 August 16; Revised 2007 November 6; Accepted 2007 November 8.

Copyright © The Author 2007. Published by Oxford University Press. All
rights reserved. For permissions, please e-mail:
journals.permissions@oxfordjournals.org.

This article has been cited by other articles in PMC.

We consider the problem of estimating sparse graphs by a lasso penalty applied to the
inverse covariance matrix. Using a coordinate descent procedure for the lasso, we develop
a simple algorithm—the *graphical lasso*—that is remarkably
fast: It solves a 1000-node problem (~500000 parameters) in at most a minute and is
30–4000 times faster than competing methods. It also provides a conceptual link
between the exact problem and the approximation suggested by Meinshausen and Bühlmann
(2006). We illustrate the method on some cell-signaling data from proteomics.

In recent years a number of authors have proposed the estimation of sparse undirected
graphical models through the use of *L*_{1} (lasso) regularization.
The basic model for continuous data assumes that the observations have a multivariate
Gaussian distribution with mean μ and covariance matrix Σ. If the
*ij*th component of Σ^{−1} is zero, then variables
*i* and *j* are conditionally independent, given the other
variables. Thus, it makes sense to impose an *L*_{1} penalty for the
estimation of Σ^{−1} to increase its sparsity.

Meinshausen and Bühlmann (2006) take a simple
approach to this problem; they estimate a sparse graphical model by fitting a lasso model to
each variable, using the others as predictors. The component is then estimated to be nonzero if either the estimated coefficient of
variable *i* on *j* or the estimated coefficient of variable
*j* on *i* is nonzero (alternatively, they use an AND rule).
They show that asymptotically, this consistently estimates the set of nonzero elements of
Σ^{−1}.

Other authors have proposed algorithms for the exact maximization of the
*L*_{1}-penalized log-likelihood; Yuan and Lin (2007), Banerjee *and others* (2007), and Dahl *and others* (2007) adapt
interior-point optimization methods for the solution to this problem. Both papers also
establish that the simpler approach of Meinshausen and
Bühlmann (2006) can be viewed as an approximation to the exact problem.

We use the blockwise coordinate descent approach in Banerjee *and others*
(2007) as a launching point and propose a new algorithm for the exact problem. This new
procedure is extremely simple and is substantially faster competing approaches in our tests.
It also bridges the “conceptual gap” between the (Meinshausen and Bühlmann, 2006) proposal and the exact problem.

Suppose, we have *N* multivariate normal observations of dimension
*p*, with mean μ and covariance Σ. Following Banerjee *and
others* (2007), let Θ = Σ ^{−1}, and let
*S* be the empirical covariance matrix, the problem is to maximize the
penalized log-likelihood

(2.1)

over nonnegative definite matrices Θ. ^{†} Here, *tr* denotes the trace and
‖Θ‖_{1} is the *L*_{1} norm—the
sum of the absolute values of the elements of Σ ^{−1}. Expression (2.1)
is the Gaussian log-likelihood of the data, partially maximized with respect to the mean
parameter μ. Yuan and Lin (2007) solve this
problem using the interior-point method for the “maxdet” problem, proposed by
Vandenberghe *and others* (1998).
Banerjee *and others* (2007) develop a different framework for the
optimization, which was the impetus for our work.

Banerjee *and others* (2007) show that the problem (2.1) is convex and
consider estimation of Σ (rather than Σ ^{−1}) as follows. Let
*W* be the estimate of Σ. They show that one can solve the problem by
optimizing over each row and corresponding column of *W* in a block
coordinate descent fashion. Partitioning *W* and
*S*

(2.2)

they show that the solution for

(2.3)

This is a box-constrained quadratic
program (QP), which they solve using an interior-point procedure. Permuting the rows and
columns so the target column is always the last, they solve a problem like (2.3) for each
column, updating their estimate of *W* after each stage. This is repeated
until convergence. If this procedure is initialized with a positive definite matrix, they
show that the iterates from this procedure remains positive definite and invertible, even if
*p* > *N*.

Using convex duality, Banerjee *and others* (2007) go on to show that
solving (2.3) is equivalent to solving the dual problem

(2.4)

where *b* =
*W*_{11}^{−1/2}*s*_{12}^{‡}; if β solves (2.4), then *w*_{12}
=*W*_{11} β solves (2.3). Expression (2.4) resembles a
lasso regression and is the basis for our approach.

First we verify the equivalence between the solutions (2.1) and (2.4) directly. Expanding
the relation *W* Θ = *I* gives an expression that
will be useful below:

(2.5)

Now the subgradient equation for maximization of the log-likelihood (2.1) is

(2.6)

using the fact that the derivative of
log det Θ equals Θ ^{−1} = *W*, given in,
for example, Boyd and Vandenberghe (2004, p 641).
Here Γ_{ij} sign
(Θ_{ij}); that is Γ_{ij}
= sign(Θ_{ij}) if
Θ_{ij} ≠0, else Γ_{ij}
[−1,1] if Θ_{ij} = 0.

Now the upper right block of (2.6) is

(2.7)

On the other hand, the subgradient equation from (2.4) works out to be

(2.8)

where ν sign(β)
elementwise. Now suppose (*W*, Γ) solves (2.6), and hence,
(*w*_{12}, γ_{12}) solves (2.7). Then β
=*W*_{11}^{−1}*w*_{1}2 and ν = −γ12 solves
(2.8). The equivalence of the first 2 terms is obvious. For the sign terms, since
*W*_{11}θ_{12} + *w*_{12}
θ_{22} = 0 from (2.5), we have that θ_{12} =
−θ_{22}*W*_{11}^{−1}*w*_{12}. Since θ_{22}
> 0, it follows that sign(θ_{12}) =
−sign(*W*_{11}^{−1}*w*_{12}) = −sign(β). This proves
the equivalence. We note that the solution β to the lasso problem (2.4) gives us (up to
a negative constant) the corresponding part of Θ: θ_{12} =
−θ_{22}β.

Now to the main point of this paper. Problem (2.4) looks like a lasso
(*L*_{1}-regularized) least-squares problem. In fact if
*W*_{11} = *S*_{11}, then the
solutions are easily seen to equal the lasso
estimates for the *p*th variable on the others and hence related to the Meinshausen and Bühlmann (2006) proposal. As
pointed out by Banerjee *and others* (2007), *W*_{11}
≠ *S*_{11} in general, and hence, the Meinshausen and Bühlmann (2006) approach does not yield the maximum
likelihood estimator. They point out that their blockwise interior point procedure is
equivalent to recursively solving and updating the lasso problem (2.4), but do not pursue
this approach. We do, to great advantage, because fast coordinate descent algorithms (Friedman *and others*, 2007) make
solution of the lasso problem very attractive.

In terms of inner products, the usual lasso estimates for the *p*th variable
on the others take as input the data *S*_{11} and
*s*_{12}. To solve (2.4), we instead use
*W*_{11} and *s*_{12}, where
*W*_{11} is our current estimate of the upper block of
*W*. We then update *w* and cycle through all of the
variables until convergence.

Note that from (2.6), the solution *w*_{ii}
= *s*_{ii} + ρ for all
*i*, since θ_{ii} > 0, and hence,
Γ_{ii} = 1. For convenience we call this algorithm
the *graphical lasso*. Here is the algorithm in detail:

*Graphical lasso algorithm*

- Start with
*W*=*S*+ ρ*I*. The diagonal of*W*remains unchanged in what follows. - For each
*j*= 1,2,…*p*,1,2,…*p*,…, solve the lasso problem (2.4), which takes as input the inner products*W*_{11}and*s*_{12}. This gives a*p*− 1 vector solution . Fill in the corresponding row and column of*W*using*w*_{12}=*W*_{11}. - Continue until convergence.

There is a simple, conceptually appealing way to view this procedure. Given a data matrix
**X** and outcome vector **Y**, we can think of the linear least-squares
regression estimates (**X**^{T}**X**)
^{−1}
**X**_{T}**y** as functions not of the raw data,
but instead the inner products **X**^{T}**X** and
**X**^{T}**y**. Similarly, one can show that
the lasso estimates are functions of these inner products as well. Hence, in the current
problem, we can think of the lasso estimates for the *p*th variable on the
others as having the functional form

(2.9)

But application of the lasso to each
variable does not solve problem (2.1); to solve this via the graphical lasso we instead use
the inner products *W*_{11} and *s*_{12}. That
is, we replace (2.9) by

(2.10)

The point is that problem (2.1) is not equivalent to *p* separate
regularized regression problems, but to *p* coupled lasso problems that share
the same *W* and Θ =*W*^{−1}. The
use of in place of *W*^{11}
*S*_{11} shares the information between the problems in an
appropriate fashion.

Note that each iteration in step (2.2) implies a permutation of the rows and columns to
make the target column the last. The lasso problem in step (2.2) above can be efficiently
solved by coordinate descent (Friedman *and
others*, 2007), (Wu and Lange,
2007). Here are the details. Letting *V*
=*W*_{11} and *u* =
*s*_{12}, then the update has the form

(2.11)

for *j*
=1,2,…,*p*,1,2,…*p*,…, where
*S* is the soft-threshold operator:

(2.12)

We cycle through the predictors until
convergence. In our implementation, the procedure stops when the average absolute change in
*W* is less than *t* · ave
|*S*^{−diag}|, where
*S*^{−diag} are the off-diagonal elements of the empirical
covariance matrix *S*, and *t* is a fixed threshold, set by
default at 0.001.

Note that will typically be sparse, and so the
computation *w*_{12}
=*W*_{11} will be fast; if there are *r* nonzero elements, it
takes *rp* operations.

Although our algorithm has estimated =*W*, we can recover
=*W*^{−1} relatively cheaply. Note that from the
partitioning in (2.5), we have

from which we derive the standard partitioned inverse expressions

(2.13)

(2.14)

But since =*W*_{11}^{−1}*w*_{12}, we have that
_{22} =
1/(*w*_{22} −*w*_{12}^{T} and _{12} = −_{22}. Thus, _{12} is a simple rescaling of by −_{22}, which is easily computed. Although these calculations
could be included in step 2.2 of the *graphical lasso algorithm*, they are
not needed till the end; hence, we store all the coefficients β for each of the
*p* problems in a *p* × *p* matrix
and compute
after convergence.

Interestingly, if *W* = *S*, these are just the
formulas for obtaining the inverse of a partitioned matrix. That is, if we set
*W* = *S* and ρ = 0 in the above
algorithm, then one sweep through the predictors computes
*S*^{−1}, using a linear regression at each stage.

REMARK 2.1 In some situations it might make sense to specify different amounts of regularization for each variable, or even allow each inverse covariance element to be penalized differently. Thus, we maximize the log-likelihood

(2.15)

where *P*
={ρ*jk*} with ρ*jk* =
ρ*kj*, and * indicates componentwise multiplication. It is easy to
show that (2.15) is maximized by the preceding algorithm, with ρ replaced by
ρ*jk* in the soft-thresholding step (2.11). Typically, one might take
for some values
ρ1,ρ2,…ρ*p*, to allow different amounts of
regularization for each variable.

REMARK 2.2 If the diagonal elements are left out of the penalty in (2.1), the
solution for *w*_{ii}is simply
*s*_{ii}, and otherwise the algorithm is the
same as before.

We simulated Gaussian data from both *sparse* and *dense*
scenarios, for a range of problem sizes *p*. The sparse scenario is the AR(1)
model taken from Yuan and Lin (2007): ${({\Sigma}^{-1})}_{ii}=1$, ${({\Sigma}^{-1})}_{i,i-1}={({\Sigma}^{-1})}_{i-1,i}=0.5$, and zero otherwise. In the dense scenario, ${({\Sigma}^{-1})}_{ii}=2$, ${({\Sigma}^{-1})}_{ii}=1$ otherwise. We chose the penalty parameter so that the
solution had about the actual number of nonzero elements in the sparse setting and about
half of total number of elements in the dense setting. The graphical lasso procedure was
coded in Fortran, linked to an R language function. All timings were carried out on a Intel
Xeon 2.80 GHz processor.

We compared the graphical lasso to the COVSEL program provided by Banerjee *and
others* (2007). This is a Matlab program, with a loop that calls a C language code
to do the box-constrained QP for each column of the solution matrix. To be as fair as
possible to COVSEL, we only counted the CPU time spent in the C program. We set the maximum
number of outer iterations to 30 and, following the authors code, set the the duality gap
for convergence to 0.1.

The number of CPU seconds for each trial is shown in Table 1. The algorithm took between 2 and 8 iterations of the outer loop. In the
dense scenarios for *p* = 200 and 400, COVSEL had not converged by 30
iterations. We see that the graphical lasso is 30–4000 times faster than COVSEL and
only about 2–10 times slower than the approximate method.

Figure 1 shows the number of CPU seconds required
for the graphical lasso procedure, for problem sizes up to 1000. The computation time is
$O({p}^{3})$ for dense problems and considerably less than that for sparse
problems. Even in the dense scenario, it solves a 1000-node problem (~ 500000
parameters) in about a minute. However, the computation time depends strongly on the value
of *ρ*, as illustrated in Table
2. Number of CPU seconds required for the graphical lasso procedure.

For illustration, we analyze a flow cytometry data set on *p* = 11
proteins and *n* = 7466 cells, from Sachs *and others* (2003). These authors fit a directed acyclic
graph (DAG) to the data, producing the network in Figure
2.

Directed acylic graph from cell-signaling data, from Sachs *and others* (2003).

The result of applying the graphical lasso to these data is shown in Figure 3, for 12 different values of the penalty parameter
*ρ*. There is moderate agreement between, for example, the graph for
*L*_{1} norm = 0. 00496 and the DAG:The former has about
half of the edges and nonedges that appear in the DAG. Figure 4 shows the lasso coefficients as a function of total
*L*_{1} norm of the coefficient vector.

Cell-signaling data: Undirected graphs from graphical lasso with different values of
the penalty parameter *ρ*.

Cell-signaling data: Profile of coefficients as the total
*L*_{1}norm of the coefficient vector increases, that is as
*ρ*decreases. Profiles for the largest coefficients are labeled
with the corresponding pair of proteins.

In the left panel of Figure 5, we tried 2 different
kinds of 10-fold cross-validation for estimation of the parameter *ρ*. In
the “regression” approach, we fit the graphical lasso to nine-tenths of the data
and used the penalized regression model for each protein to predict the value of that
protein in the validation set. We then averaged the squared prediction errors over all 11
proteins. In the “likelihood” approach, we again applied the graphical lasso to
nine-tenths of the data and then evaluated the log-likelihood (2.1) over the validation set.
The 2 cross-validation curves indicate that the unregularized model is the best, not
surprising given the large number of observations and relatively small number of parameters.
However, we also see that the likelihood approach is far less variable than the regression
method.

Cell-signaling data. Left panel shows 10-fold cross-validation using both regression
and likelihood approaches (details in text). Right panel compares the regression sum of
squares of the exact graphical lasso approach to the Meinhausen–Buhlmann
approximation. **...**

The right panel compares the cross-validated sum of squares of the exact graphical lasso approach to the Meinhausen–Buhlmann approximation. For lightly regularized models, the exact approach has a clear advantage.

We have presented a simple and fast algorithm for estimation of a sparse inverse covariance
matrix using an *L*_{1} penalty. It cycles through the variables,
fitting a modified lasso regression to each variable in turn. The individual lasso problems
are solved by coordinate descent.

The speed of this new procedure should facilitate the application of sparse inverse covariance procedures to large data sets involving thousands of parameters.

An R language package glasso is available on the third author's Web site.

National Science Foundation (DMS-97-64431 to J.F., DMS-0505676 to T. H., DMS-9971405 to R.T.); National Institutes of Health (2R01 CA 72028-07 to T. H.); National Institutes of Health Contract (N01-HV-28183 to R.T.).

We thank the authors of Banerjee *and others* (2007) for making their COVSEL
program publicly available, Larry Wasserman for helpful discussions, and an editor and 2
referees for comments that led to improvements in the manuscript.

^{†}We note that while most authors use this formulation, Yuan and Lin (2007) omit the diagonal elements from the penalty.

^{‡}The corresponding expression in Banerjee *and others* (2007) does not have
the leading and has a factor of
in *b*. We have
written it in this equivalent form to avoid factors of later.

- Banerjee O, Ghaoui LE, d'Aspremont A. Model selection through sparse maximum likelihood estimation. Journal of Machine Learning Research. 2007 101 (to appear)
- Boyd S, Vandenberghe L. Convex Optimization. Cambridge: Cambridge University Press; 2004.
- Dahl J, Vandenberghe L, Roychowdhury V. Covariance selection for non-chordal graphs via chordal embedding. Optimization Methods and Software (to appear) 2007
- Friedman J, Hastie T, Hoefling H, Tibshirani R. Pathwise coordinate optimization. Annals of Applied Statistics. 2007;2
- Meinshausen N, Bühlmann P. High dimensional graphs and variable selection with the lasso. Annals of Statistics. 2006;34:1436–1462.
- Sachs K, Perez O, Pe'er D, Lauffenburger D, Nolan G. Causal protein-signaling networks derived from multiparameter single-cell data. Science. 2003;308:523–529. [PubMed]
- Vandenberghe L, Boyd S, Wu S-P. Determinant maximization with linear matrix inequality constraints. SIAM Journal on Matrix Analysis and Applications. 1998;19:499–533.
- Wu T, Lange K. Coordinate descent procedures for lasso penalized regression. Annals of Applied Statistics. 2007;3
- Yuan M, Lin Y. Model selection and estimation in the gaussian graphical model. Biometrika. 2007;94:19–35.

Articles from Biostatistics (Oxford, England) are provided here courtesy of **Oxford University Press**

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |