Suppose, we have
N multivariate normal observations of dimension
p, with mean μ and covariance Σ. Following Banerjee
and
others (2007), let Θ = Σ
^{−1}, and let
S be the empirical covariance matrix, the problem is to maximize the
penalized loglikelihood
over nonnegative definite matrices Θ.
^{†} Here,
tr denotes the trace and
‖Θ‖
_{1} is the
L_{1} norm—the
sum of the absolute values of the elements of Σ
^{−1}. Expression (2.1)
is the Gaussian loglikelihood of the data, partially maximized with respect to the mean
parameter μ.
Yuan and Lin (2007) solve this
problem using the interiorpoint method for the “maxdet” problem, proposed by
Vandenberghe and others (1998).
Banerjee
and others (2007) develop a different framework for the
optimization, which was the impetus for our work.
Banerjee
and others (2007) show that the problem (2.1) is convex and
consider estimation of Σ (rather than Σ
^{−1}) as follows. Let
W be the estimate of Σ. They show that one can solve the problem by
optimizing over each row and corresponding column of
W in a block
coordinate descent fashion. Partitioning
W and
Sthey show that the solution for
This is a boxconstrained quadratic
program (QP), which they solve using an interiorpoint procedure. Permuting the rows and
columns so the target column is always the last, they solve a problem like (2.3) for each
column, updating their estimate of
W after each stage. This is repeated
until convergence. If this procedure is initialized with a positive definite matrix, they
show that the iterates from this procedure remains positive definite and invertible, even if
p >
N.
Using convex duality, Banerjee
and others (2007) go on to show that
solving (2.3) is equivalent to solving the dual problem
where
b =
W_{11}^{−1/2}s_{12}^{‡}; if β solves (2.4), then
w_{12}
=
W_{11} β solves (2.3). Expression (2.4) resembles a
lasso regression and is the basis for our approach.
First we verify the equivalence between the solutions (2.1) and (2.4) directly. Expanding
the relation
W Θ =
I gives an expression that
will be useful below:
Now the subgradient equation for maximization of the loglikelihood (2.1) is
using the fact that the derivative of
log det Θ equals Θ
^{−1} =
W, given in,
for example,
Boyd and Vandenberghe (2004, p 641).
Here Γ
_{ij} sign
(Θ
_{ij}); that is Γ
_{ij}
= sign(Θ
_{ij}) if
Θ
_{ij} ≠0, else Γ
_{ij}
[−1,1] if Θ
_{ij} = 0.
Now the upper right block of (2.6) is
On the other hand, the subgradient equation from (2.4) works out to be
where ν
sign(β)
elementwise. Now suppose (
W, Γ) solves (2.6), and hence,
(
w_{12}, γ
_{12}) solves (2.7). Then β
=
W_{11}^{−1}w_{1}2 and ν = −γ12 solves
(2.8). The equivalence of the first 2 terms is obvious. For the sign terms, since
W_{11}θ
_{12} +
w_{12}
θ
_{22} = 0 from (2.5), we have that θ
_{12} =
−θ
_{22}W_{11}^{−1}w_{12}. Since θ
_{22}
> 0, it follows that sign(θ
_{12}) =
−sign(
W_{11}^{−1}w_{12}) = −sign(β). This proves
the equivalence. We note that the solution β to the lasso problem (2.4) gives us (up to
a negative constant) the corresponding part of Θ: θ
_{12} =
−θ
_{22}β.
Now to the main point of this paper. Problem (2.4) looks like a lasso
(
L_{1}regularized) leastsquares problem. In fact if
W_{11} =
S_{11}, then the
solutions
are easily seen to equal the lasso
estimates for the
pth variable on the others and hence related to the
Meinshausen and Bühlmann (2006) proposal. As
pointed out by Banerjee
and others (2007),
W_{11}
≠
S_{11} in general, and hence, the
Meinshausen and Bühlmann (2006) approach does not yield the maximum
likelihood estimator. They point out that their blockwise interior point procedure is
equivalent to recursively solving and updating the lasso problem (2.4), but do not pursue
this approach. We do, to great advantage, because fast coordinate descent algorithms (
Friedman and others, 2007) make
solution of the lasso problem very attractive.
In terms of inner products, the usual lasso estimates for the pth variable
on the others take as input the data S_{11} and
s_{12}. To solve (2.4), we instead use
W_{11} and s_{12}, where
W_{11} is our current estimate of the upper block of
W. We then update w and cycle through all of the
variables until convergence.
Note that from (2.6), the solution w_{ii}
= s_{ii} + ρ for all
i, since θ_{ii} > 0, and hence,
Γ_{ii} = 1. For convenience we call this algorithm
the graphical lasso. Here is the algorithm in detail:
Graphical lasso algorithm Start with W = S +
ρI. The diagonal of W remains unchanged in
what follows.
 For each j = 1,2,…
p,1,2,…p,…, solve the lasso problem
(2.4), which takes as input the inner products W_{11} and
s_{12}. This gives a p− 1 vector
solution . Fill in the corresponding row
and column of W using w_{12} =
W_{11}.
 Continue until convergence.
There is a simple, conceptually appealing way to view this procedure. Given a data matrix
X and outcome vector
Y, we can think of the linear leastsquares
regression estimates (
X^{T}X)
^{−1}
X_{T}y as functions not of the raw data,
but instead the inner products
X^{T}X and
X^{T}y. Similarly, one can show that
the lasso estimates are functions of these inner products as well. Hence, in the current
problem, we can think of the lasso estimates for the
pth variable on the
others as having the functional form
But application of the lasso to each
variable does not solve problem (2.1); to solve this via the graphical lasso we instead use
the inner products
W_{11} and
s_{12}. That
is, we replace (2.9) by
The point is that problem (2.1) is not equivalent to p separate
regularized regression problems, but to p coupled lasso problems that share
the same W and Θ =W^{−1}. The
use of in place of W^{11}
S_{11} shares the information between the problems in an
appropriate fashion.
Note that each iteration in step (2.2) implies a permutation of the rows and columns to
make the target column the last. The lasso problem in step (2.2) above can be efficiently
solved by coordinate descent (
Friedman and
others, 2007), (
Wu and Lange,
2007). Here are the details. Letting
V
=
W_{11} and
u =
s_{12}, then the update has the form
for
j
=1,2,…,
p,1,2,…
p,…, where
S is the softthreshold operator:
We cycle through the predictors until
convergence. In our implementation, the procedure stops when the average absolute change in
W is less than
t · ave

S^{−diag}, where
S^{−diag} are the offdiagonal elements of the empirical
covariance matrix
S, and
t is a fixed threshold, set by
default at 0.001.
Note that
will typically be sparse, and so the
computation
w_{12}
=
W_{11} will be fast; if there are
r nonzero elements, it
takes
rp operations.
Although our algorithm has estimated
=
W, we can recover
=
W^{−1} relatively cheaply. Note that from the
partitioning in (2.5), we have
from which we derive the standard partitioned inverse
expressions
But since
=
W_{11}^{−1}w_{12}, we have that
_{22} =
1/(
w_{22} −
w_{12}^{T} and
_{12} = −
_{22}. Thus,
_{12} is a simple rescaling of
by −
_{22}, which is easily computed. Although these calculations
could be included in step 2.2 of the
graphical lasso algorithm, they are
not needed till the end; hence, we store all the coefficients β for each of the
p problems in a
p ×
p matrix
and compute
after convergence.
Interestingly, if W = S, these are just the
formulas for obtaining the inverse of a partitioned matrix. That is, if we set
W = S and ρ = 0 in the above
algorithm, then one sweep through the predictors computes
S^{−1}, using a linear regression at each stage.
R
EMARK 2.1 In some situations it might make sense to specify different amounts of
regularization for each variable, or even allow each inverse covariance element to be
penalized differently. Thus, we maximize the loglikelihood
where
P
={ρ
jk} with ρ
jk =
ρ
kj, and * indicates componentwise multiplication. It is easy to
show that (2.15) is maximized by the preceding algorithm, with ρ replaced by
ρ
jk in the softthresholding step (2.11). Typically, one might take
for some values
ρ1,ρ2,…ρ
p, to allow different amounts of
regularization for each variable.
REMARK 2.2 If the diagonal elements are left out of the penalty in (2.1), the
solution for w_{ii}is simply
s_{ii}, and otherwise the algorithm is the
same as before.