The algorithm produces the sparsest precision matrix that still fits the data (see below). It also guarantees that Σ^{-1 }is positive-definite, which means it can be inverted to produce the regularized covariance matrix (as opposed to the sample covariance, which is trivial to compute). This is important because Eqs 1-3 require the covariance matrix, Σ. We further note that a sparse precision matrix does not imply that the corresponding covariance matrix is sparse, nor does a sparse covariance imply that the corresponding precision matrix is sparse. That is, our algorithm isn't equivalent to simply thresholding the sample covariance matrix, and then inverting.

Learning regularized precision matrices

A straight-forward way of learning a GGM is to find the parameters (

) that maximize the likelihood of the data (i.e., by finding parameters that maximize

. It is known that a maximum likelihood model can be produced by setting the pair

to the sample mean and covariance matrices, respectively. Unfortunately, maximum likelihood estimates can be prone to over-fitting. This is not surprising because the covariance matrix alone contains

*m *=

*O*(

*n*^{2}) parameters, each of which must be estimated from the data. This is relevant because the number of

*independent *samples needed to obtain a statistically robust estimate of Σ grows polynomially in

*m*. We note that while modern MD simulations do produce large numbers of samples (i.e., frames), these samples are

*not *independent (because they form a time-series), and so the effective sample size is much smaller than the number of frames in the trajectory.

Our algorithm addresses the problem of over-fitting by maximizing the following objective function:

Here, ||Σ

^{-1}||

_{1 }is the

*L*_{1 }norm of the precision matrix. The

*L*_{1 }norm is defined as the sum of the absolute values of the matrix elements. It can be interpreted as a measure of the complexity of the model. In particular, each non-zero element of Σ

^{-1 }corresponds to a parameter in the model and must be estimated from the data. Thus, Eq. 6 establishes a tradeoff between the log likelihood of the data (the first term) and the complexity of the model (the second term). The scalar value λ controls this tradeoff such that higher values produce sparser precision matrices. This is our algorithm's only parameter. Its value can be computed analytically [

27] from the number of frames in the trajectory and variables. Alternatively, users may elect to adjust λ to obtain precision matrices of desired sparsity.

Algorithmically, our algorithm maximizes Eq. 6 in an indirect fashion, by defining and then solving a convex optimization problem. Using the functional form of

according to Eq. 1, the log-likelihood of Σ

^{-1 }can be rewritten as:

Noting that

and that

*trace *(

**ABC**) =

*trace*(

**CAB**), the log-likelihood of Σ

^{-1 }can then be rewritten as:

Next, using the definition of the sample covariance matrix,

we can define the matrix Σ^{-1 }that maximizes 6 as the solution to the following optimization problem:

We note that *L*_{1 }regularization is equivalent to maximizing the likelihood under a Laplace prior and so the solution to Eq. 7 is a *maximum a posteriori *(MAP) estimate of the true precision matrix, as opposed to a maximum likelihood estimate. That is, our algorithm is a Bayesian method. Moreover, the use of *L*_{1 }regularization ensures additional desirable properties including *consistency *-- given enough data, the learning procedure learns the true model, and high statistical *efficiency *-- the number of samples needed to achieve this guarantee is small.

We now show that the optimization problem defined in Eq. 7 is smooth and convex and can therefore be solved optimally. First, we consider the dual form of the objective. To obtain the dual, we first rewrite the *L*_{1}-norm as:

where ||**U**||∞ denotes the maximum absolute value element of the matrix **U**. Given this change of formulation, the primal form of the optimization problem can be rewritten as:

That is, the optimal Σ^{-1 }is the one that maximizes the worst case log likelihood over all additive perturbations of the covariance matrix.

Next, we exchange the *min *and *max *in Eq. 8. The inner *max *in the resulting function can now be solved analytically by calculating the gradient and setting it to zero. The primal form of the objective can thus be written as:

such that Σ^{-1 }= (**S **+ **U***)^{-1}.

After one last change of variables, **W **= **S **+ **U**, the dual form of Eq. 7 can now be defined as:

Eq. 9 is smooth and convex, and for small values of

*n *it can be solved by standard convex multivariate optimization techniques, such as the interior point method. For larger values of

*n *we use Block Coordinate Descent [

27] instead.

Block Coordinate Descent

Given matrix

**A**, let

**A**_{\k\j }denote the matrix produced by removing column

*k *and row

*j *of the matrix. Let

**A**_{j }also denote the column

*j*, with diagonal element

**A**_{jj }removed. The Block Coordinate Descent algorithm [

27]. Algorithm 1 proceeds by optimizing one row and one column of the variable matrix

**W **at a time. The algorithm iteratively optimizes all columns until a convergence criteria is met. The

**W**s produced in each iterations are strictly positive definite and so the regularized covariance matrix Σ =

*W *is invertible.

**Algorithm 1 **Block Coordinate Descent

**Require**: Tolerance parameter ε, sample covariance **S**, and regularization parameter λ.

Initialize **W**^{(0)}:= **S **+ λ**I **where **I **is the identity matrix.

**repeat**

**for ***j *= 1,... *n ***do**

{//Here,

**W**^{(j-1) }denotes the current iterate.}

Set **W**^{(j) }to **W**^{(j-1) }such that **W**_{j }is replaced by *y**.

**end for**

Set **W**^{(0) }= **W**^{(n)}

**until ***trace*((**W**^{(0)})^{-1}**S**) - *n *+ λ||(**W**^{(0)})^{-1}||_{1 }≤ ε.

**return W**^{(0)}

The time complexity of this algorithm is

*O*(

*n*^{4.5}/ε) [

27] when converging to a solution within ε > 0 of the optimal. This complexity is better than

, which would have been achieved using the interior point method on the dual form [

28].

In summary, the algorithm produces a time-averaged model of the data by computing the sample mean and then constructing the optimal regularized Σ by solving Eq. 9 using Block Coordinate Decent. The regularized covariance matrix Σ is guaranteed to be invertible which means we can always compute the precision matrix, Σ^{-1}, which can be interpreted as a graph over the variables revealing the direct and indirect correlations between the variables.