The algorithm produces the sparsest precision matrix that still fits the data (see below). It also guarantees that Σ-1 is positive-definite, which means it can be inverted to produce the regularized covariance matrix (as opposed to the sample covariance, which is trivial to compute). This is important because Eqs 1-3 require the covariance matrix, Σ. We further note that a sparse precision matrix does not imply that the corresponding covariance matrix is sparse, nor does a sparse covariance imply that the corresponding precision matrix is sparse. That is, our algorithm isn't equivalent to simply thresholding the sample covariance matrix, and then inverting.
Learning regularized precision matrices
A straight-forward way of learning a GGM is to find the parameters (
) that maximize the likelihood of the data (i.e., by finding parameters that maximize
. It is known that a maximum likelihood model can be produced by setting the pair
to the sample mean and covariance matrices, respectively. Unfortunately, maximum likelihood estimates can be prone to over-fitting. This is not surprising because the covariance matrix alone contains m
) parameters, each of which must be estimated from the data. This is relevant because the number of independent
samples needed to obtain a statistically robust estimate of Σ grows polynomially in m
. We note that while modern MD simulations do produce large numbers of samples (i.e., frames), these samples are not
independent (because they form a time-series), and so the effective sample size is much smaller than the number of frames in the trajectory.
Our algorithm addresses the problem of over-fitting by maximizing the following objective function:
is the L1
norm of the precision matrix. The L1
norm is defined as the sum of the absolute values of the matrix elements. It can be interpreted as a measure of the complexity of the model. In particular, each non-zero element of Σ-1
corresponds to a parameter in the model and must be estimated from the data. Thus, Eq. 6 establishes a tradeoff between the log likelihood of the data (the first term) and the complexity of the model (the second term). The scalar value λ controls this tradeoff such that higher values produce sparser precision matrices. This is our algorithm's only parameter. Its value can be computed analytically [27
] from the number of frames in the trajectory and variables. Alternatively, users may elect to adjust λ to obtain precision matrices of desired sparsity.
Algorithmically, our algorithm maximizes Eq. 6 in an indirect fashion, by defining and then solving a convex optimization problem. Using the functional form of
according to Eq. 1, the log-likelihood of Σ-1
can be rewritten as:
and that trace
) = trace
), the log-likelihood of Σ-1
can then be rewritten as:
Next, using the definition of the sample covariance matrix,
we can define the matrix Σ-1 that maximizes 6 as the solution to the following optimization problem:
We note that L1 regularization is equivalent to maximizing the likelihood under a Laplace prior and so the solution to Eq. 7 is a maximum a posteriori (MAP) estimate of the true precision matrix, as opposed to a maximum likelihood estimate. That is, our algorithm is a Bayesian method. Moreover, the use of L1 regularization ensures additional desirable properties including consistency -- given enough data, the learning procedure learns the true model, and high statistical efficiency -- the number of samples needed to achieve this guarantee is small.
We now show that the optimization problem defined in Eq. 7 is smooth and convex and can therefore be solved optimally. First, we consider the dual form of the objective. To obtain the dual, we first rewrite the L1-norm as:
where ||U||∞ denotes the maximum absolute value element of the matrix U. Given this change of formulation, the primal form of the optimization problem can be rewritten as:
That is, the optimal Σ-1 is the one that maximizes the worst case log likelihood over all additive perturbations of the covariance matrix.
Next, we exchange the min and max in Eq. 8. The inner max in the resulting function can now be solved analytically by calculating the gradient and setting it to zero. The primal form of the objective can thus be written as:
such that Σ-1 = (S + U*)-1.
After one last change of variables, W = S + U, the dual form of Eq. 7 can now be defined as:
Eq. 9 is smooth and convex, and for small values of n
it can be solved by standard convex multivariate optimization techniques, such as the interior point method. For larger values of n
we use Block Coordinate Descent [27
Block Coordinate Descent
Given matrix A
, let A\k\j
denote the matrix produced by removing column k
and row j
of the matrix. Let Aj
also denote the column j
, with diagonal element Ajj
removed. The Block Coordinate Descent algorithm [27
]. Algorithm 1 proceeds by optimizing one row and one column of the variable matrix W
at a time. The algorithm iteratively optimizes all columns until a convergence criteria is met. The W
s produced in each iterations are strictly positive definite and so the regularized covariance matrix Σ = W
Algorithm 1 Block Coordinate Descent
Require: Tolerance parameter ε, sample covariance S, and regularization parameter λ.
Initialize W(0):= S + λI where I is the identity matrix.
for j = 1,... n do
denotes the current iterate.}
Set W(j) to W(j-1) such that Wj is replaced by y*.
Set W(0) = W(n)
until trace((W(0))-1S) - n + λ||(W(0))-1||1 ≤ ε.
The time complexity of this algorithm is O
] when converging to a solution within ε > 0 of the optimal. This complexity is better than
, which would have been achieved using the interior point method on the dual form [28
In summary, the algorithm produces a time-averaged model of the data by computing the sample mean and then constructing the optimal regularized Σ by solving Eq. 9 using Block Coordinate Decent. The regularized covariance matrix Σ is guaranteed to be invertible which means we can always compute the precision matrix, Σ-1, which can be interpreted as a graph over the variables revealing the direct and indirect correlations between the variables.