PMCCPMCCPMCC

Search tips
Search criteria 

Advanced

 
Logo of nihpaAbout Author manuscriptsSubmit a manuscriptHHS Public Access; Author Manuscript; Accepted for publication in peer reviewed journal;
 
Proc Int Conf Mach Learn. Author manuscript; available in PMC 2010 May 18.
Published in final edited form as:
Proc Int Conf Mach Learn. 2009; 382: 905–912.
doi:  10.1145/1553374.1553490
PMCID: PMC2872253
NIHMSID: NIHMS112676

Learning structurally consistent undirected probabilistic graphical models

Abstract

In many real-world domains, undirected graphical models such as Markov random fields provide a more natural representation of the statistical dependency structure than directed graphical models. Unfortunately, structure learning of undirected graphs using likelihood-based scores remains difficult because of the intractability of computing the partition function. We describe a new Markov random field structure learning algorithm, motivated by canonical parameterization of Abbeel et al. We provide computational improvements on their parameterization by learning per-variable canonical factors, which makes our algorithm suitable for domains with hundreds of nodes. We compare our algorithm against several algorithms for learning undirected and directed models on simulated and real datasets from biology. Our algorithm frequently outperforms existing algorithms, producing higher-quality structures, suggesting that enforcing consistency during structure learning is beneficial for learning undirected graphs.

1. Introduction

Probabilistic graphical models (PGMs) representing real-world networks capture important structural and functional aspects of the network by describing a joint probability distribution of all node measurements. The structure encodes conditional independence assumptions allowing the joint probability distribution to be tractably computed. When the structure is unknown, likelihood-based structure learning algorithms are employed to infer the structure from observed data.

Likelihood-based structure learning of directed acyclic graphs (DAGs), such as Bayesian networks, is widely used because the likelihood score can be tractably computed for all candidate DAGs. However, in many domains such as biology, causal implication of directed edges is difficult to ascertain without perturbations, leaving only a correlation implication. In such situations, undirected graphical models are a more natural representation of statistical dependencies. Unfortunately, likelihood-based structure learning of these models is much harder due to the intractability of the partition function (Abbeel et al., 2006).

To overcome this issue, researchers have opted several alternatives: learn graphical Gaussian models where the likelihood can be computed tractably (Li & Yang, 2005); restrict to lower order, often pairwise functions, (Margolin et al., 2005; Lee et al., 2007); use pseudolikelihood as structure score instead of likelihood (Besag, 1977); learn dependency networks (Heckerman et al., 2000; Schmidt et al., 2007); or learn Markov blanket canonical factors (Abbeel et al., 2006). Pairwise models are scalable, but, approximate higher-order dependencies by pairwise functions, which is limiting for domains where higher-order dependencies occur commonly. While dependency networks are scalable, each variable neighborhood is estimated independently, resulting in inconsistent structures when the data sample size is small. This is problematic for real-world data which often lack sufficient samples to guarantee a consistent joint probability distribution for the learned structure. Finally, Markov blanket canonical parameterization requires exhaustive enumeration of variable subsets up to a pre-specified size l, which is not scalable for networks with hundreds of nodes.

We have developed a new algorithm for learning undirected graphical models, that produces consistent structures and is scalable to be applicable for real-world domains. Our algorithm, Markov blanket search (MBS) is inspired by Abbeel et al.'s Markov blanket canonical parameterization, which established an equivalence between global canonical potentials and local Markov blanket canonical factors (MBCFs) (Abbeel et al., 2006). We extend Abbeel et al.'s result to establish further equivalence between MBCFs and per-variable canonical factors. Because per-variable canonical factors require learning Markov blankets per-variable, rather than all subsets up to size l, we save O(nl−1) computations during structure learning, where n is the number of variables. The equivalence of per-variable canonical factors and global canonical factors has been observed before (Paget, 1999). However, we are the first to use per-variable canonical factors in the context of MRF structure learning to learn consistent MRF structures. Enforcing structural consistency during search, guarantees the structure to be a MRF, and also the existence of a joint distribution for the individual conditional distributions. Thus we need not perform additional post-processing to guarantee consistent structures (Schmidt et al., 2007).

We compare our algorithm against two existing algorithms for learning undirected models: Algorithm for the Reconstruction of Accurate Cellular Networks (ARACNE) (Margolin et al., 2005), and a Lasso regression based dependency network algorithm (GGLAS) (Li & Yang, 2005). ARACNE learns pairwise dependencies, whereas GGLAS learns both pairwise and higher-order dependencies. On simulated data from networks of known topology, MBS captures the structure better than ARACNE for the majority of the datasets. Although GGLAS and MBS are often tied in performance, GGLAS's assumption that variable ordering is irrelevant, is true only for the Gaussian distribution. MBS uses a more general framework of per-variable canonical factors, which can be used with any conditional probability distribution family.

We also compare MBS to several algorithms for learning DAG structures. MBS not only outperforms the algorithms performing DAG searches, but provides a better pruning of the structure search space than the L1 regularization-based Markov blanket and sparse candidate algorithms (Schmidt et al., 2007; Friedman et al., 1999). This suggests that learning consistent structures during structure search is better than post-processing learned structures to enforce consistency. We finally apply ARACNE, MBS and the sparse candidate algorithm to four real-world microarray data sets. Subgraphs generated from MBS-inferred networks represent more biologically meaningful dependencies than subgraphs from the other algorithms.

To summarize, MBS has the following advantages: (a) it captures both higher-order and pairwise dependencies, (b) it learns consistent structures ensuring the existence of a joint distribution, (c) it provides a tractable implementation of the theoretical framework of Abbeel et al.. This final property allows MBS to scale to real-world domains with hundreds of nodes.

2. Markov random fields

A Markov random field (MRF) is an undirected, probabilistic graphical model that represents statistical dependencies among a set of random variables (RVs), X = {X1, …, Xn}. A MRF consists of a graph An external file that holds a picture, illustration, etc.
Object name is nihms112676ig1.jpg and a set of potential functions ψ = {ψ1, …, ψm}, one for each clique in An external file that holds a picture, illustration, etc.
Object name is nihms112676ig1.jpg. The graph structure describes the statistical dependencies, and the potentials describe the functional relationships between the RVs. The RVs encode the observed measurements for each node, Xi [set membership] An external file that holds a picture, illustration, etc.
Object name is nihms112676ig2.jpg. The joint probability distribution of the MRF is defined to be: P(X=x)=1Zi=1mψi(Fi=fi), where x is a joint assignment to X, Fi [subset, dbl equals] X is the variable set in the ith clique, associated with ψi; fi [subset, dbl equals] x is a joint assignment to Fi. Z is the partition function and is defined as a summation over all possible joint assignments of X.

Structure learning of MRFs using likelihood is difficult in general because of Z (Abbeel et al., 2006). This is because estimating Z requires a sum of exponentially many joint configurations of the RVs, making it intractable for real-world domains. To overcome this problem, researchers have proposed approaches that use pseudolikelihood (Besag, 1977; Heckerman et al., 2000), or, have used Markov blanket canonical parameterization (MBCP) (Abbeel et al., 2006). We use an approach similar to MBCP, which requires the estimation of optimal Markov blankets for RV subsets, Y [subset, dbl equals] X, |Y| ≤ l, where l is a pre-specified, maximum subset size. However, we learn local per-variable factors, requiring estimation of Markov blankets of only individual RVs. Avoiding Markov blanket estimation of all subsets, makes our approach scalable to domains with hundreds of nodes.

2.1. Hammersly-Clifford theorem and canonical potentials

The Hammersly-Clifford theorem establishes a one-to-one relationship between MRFs and strictly positive distributions such as the Gibbs distributions. The canonical potentials (also called An external file that holds a picture, illustration, etc.
Object name is nihms112676ig3.jpg-potentials (Paget, 1999)) are used together with the Möbius inversion theorem to prove the Hammersly-Clifford theorem (Lauritzen, 1996). The canonical potential for a subset D [subset, dbl equals] X is defined using a default joint instantiation, [x with macron] = {[x with macron]1, …, [x with macron]|X|} to X as:

ψD(D=d)=exp(UD(1)|D\U|logP(X=σ(U,X,d)))

where σ(A, B, c) is an assignment function to variables Xk [set membership] B such that σ(A, B, c)[k] = ck, if Xk [set membership] A and σ(A, B, c)[k] = [x with macron]k if Xk [negated set membership] A. σ returns an assignment for all variables in B.

The joint probability distribution for a MRF using canonical potentials is defined to be: P(X=x)=P(x¯)DCψD, where An external file that holds a picture, illustration, etc.
Object name is nihms112676ig4.jpg is the set of maximal cliques in the graph. This is true by an application of the Möbius inversion and setting ψD=0 for all D [negated set membership] An external file that holds a picture, illustration, etc.
Object name is nihms112676ig4.jpg (Lauritzen, 1996; Paget, 1999).

2.2. Markov blanket canonical parameterization

The computation of the canonical potentials is not feasible for real-world domains as they require the estimation of the full joint distribution (Abbeel et al., 2006). Markov Blanket canonical parameterization, developed by Abbeel et al., allows the computation of global canonical potentials over X, using local conditional functions called Markov blanket canonical factors (MBCFs).

The MBCF, [psi] for a set D [subset, dbl equals] X is estimated using D and its Markov blanket (MB). The MB, Mi of a variable Xi, is the set of immediate neighbors of Xi in An external file that holds a picture, illustration, etc.
Object name is nihms112676ig1.jpg and renders Xi conditionally independent of other variables, i.e., P(Xi[mid ]X\{Xi}) = P(Xi[mid ]Mi). The MB, MD of a set D, is ([union or logical sum]j Mj) \ D for all Xj [set membership] D. The MBCF, [psi] for D is also defined using the default joint instantiation, [x with macron] = {[x with macron]1, …, [x with macron]|X|} as:

ψD(D=d)=exp(UD(1)|D\U|logP(D=σ(U,D,d)MD=σ(U,MD,d))),
(1)

For MRFs of unknown structure, MBCFs are identified by searching exhaustively among all subsets Fi [subset or is implied by] X, up to size l and finding MBs for each Fi. Unfortunately, exhaustive enumeration of variable subsets becomes impractical for moderately sized networks (Abbeel et al., 2006). We show that the MBCFs can be further reduced to smaller per-variable canonical factors, which are computed using an RV and its Markov blanket.

2.3. Per-variable MB canonical factors

We now show that the MBCFs can be replaced by smaller, local functions: per-variable MB canonical factors, which does not require enumeration of all subsets up to size l. Specifically, for every MB canonical factor [psi] there exists an equivalent per-variable canonical factor ψ+. To illustrate how the per-variable factors are derived from MBCFs, we first consider a specific case of D = {Xi, Xj} in Eq 1 (Section 2.3.1), followed by a proof for the general case (Section 2.3.2).

2.3.1. Special Case of Two Variables

Let D = {Xi, Xj} and d = {xi, xj}. Note, because DMD = Ø, σ(U, MD, d) = md, the default instantiation to MD from [x with macron]. We first expand the sum inside the exponential of Eq 1 with D = {Xi, Xj}:

UD(1)|D\U|logP({Xi,Xj}=σ(U,{Xi,Xj},d)MD=m¯d)=(1)|{Xi,Xj}|logP(Xi=x¯i,Xj=x¯jMD=m¯d)+(1)|{Xj}|logP(Xi=xi,Xj=x¯jMD=m¯d)(1)|{Xi}|logP(Xi=x¯i,Xj=xjMD=m¯d)+(1)||logP(Xi=xi,Xj=xjMD=m¯d)
(2)

where the first term corresponds to U = Ø, the second term corresponds to U = {Xi} and so on. Applying the chain rule to every term in the RHS:

=log[P(Xi=x¯iXj=x¯j,MD=m¯d)P(Xj=x¯jMD=m¯d)]log[P(Xi=xiXj=x¯j,MD=m¯d)P(Xj=x¯jMD=m¯d)]log[P(Xi=x¯iXj=xj,MD=m¯d)P(Xj=xjMD=m¯d)]+log[P(Xi=xiXj=xj,MD=m¯d)P(Xj=xjMD=m¯d)]

We find that all logP(Xj[mid ]MD) terms cancel producing:

=logP(Xi=x¯i,Xj=x¯j,MD=m¯d)logP(Xi=xiXj=x¯j,MD=m¯d)logP(Xi=x¯iXj=xj,MD=m¯d)+logP(Xi=xiXj=xj,MD=m¯d)
(3)

This allows [psi] to be rewritten as:

ψD(D=d)=exp(UD(1)|D\U|logP(Xi=σ(U,{Xi},d){Xj}MD=σ(U,{Xj}MD,d)))
(4)

We assert further independence in Eq 4 because Xi is independent of all variables other than Mi. This allows us to write the original MBCF for {Xi,Xj} as the per-variable canonical factor:

ψD+(D=d)=exp(UD(1)|D\U|logP(Xi=σ(U,{Xi},d)Mi=σ(U,Mi,d)))
(5)

Thus we have equivalent the per-variable canonical factor, ψ+ from the original MBCF [psi] in Eq 1.

2.3.2. General Case

We now state the equivalence between per-variable and MB canonical factors more formally:

Theorem 2.1 Every MBCP, [psi]D, of the form in Eq 1 possesses an equivalent per-variable factor, ψD+(D=d)=exp(UD(1)|D\U|logP(Xi=σ(U,{Xi},d)Mi=σ(U,Mi,d))), where Xi [set membership] D.

Proof The proof of this equivalence involves two steps: (a) deriving ψ+ from [psi] for any general D, and (b) identifying neighbors of an RV and making independence assertions described by the graph structure.

To prove (a) we select an arbitrary Xi [set membership] D. We replace each logP(D[mid ]MD) in [psi] by log(P(Xi[mid ]D \ {Xi} [union or logical sum] MD)P(D \ {Xi}[mid ]MD)). We have 2|D| number of logP(D \ {Xi}[mid ]MD) terms, one for each U [subset, dbl equals] D. Xi does not occur in these terms, as we have conditioned on it. These terms can be grouped into two sets, Sod and Sev, where Sod and Sev correspond to subsets of D with odd and even number of elements, respectively. Assuming |D| is even, all elements in Sod have a −ve sign and all elements in Sev have a +ve sign. Further, for every t [set membership] Sev corresponding to U [subset, dbl equals] D there exists t′ [set membership] Sod corresponding to U[subset, dbl equals] D, such that U and U′ differ only in Xi. Because Xi does not occur in either t or t′, these two terms cancel. Applying this to all elements of Sod and Sev, the two subsets cancel each other, thus proving (a). If |D| is odd, elements of Sod and Sev have +ve and −ve signs, respectively, and the rest of the argument follows.

The final step is to identify the neighbors of Xi and using the local Markov property, P(Xi[mid ]D \ {Xi} [union or logical sum] MD) = P(Xi[mid ]Mi), for strictly positive distributions (Lauritzen, 1996).

The equivalence of the per-variable factors and MBCFs implies that, instead of searching over all size l subsets of X, we can estimate canonical factors by searching for MBs of individual RVs. Assuming that the MBs are estimated correctly, Eq 5 will produce the same canonical factors as Eq 1. Our structure learning algorithm therefore requires the estimation of MBs of each RV. We only need to ensure structural consistency (Section 2.4). Searching for n MBs, as opposed to nl MBs in MBCF, saves us O(nl−1) computations.

The per-variable canonical factors and MBCFs do not deny the hardness of computing likelihood in MRFs (Abbeel et al., 2006). This is because computing P(X = [x with macron]) is equivalent to computing 1Z.

Algorithm 1 Markov Blanket Search

Input:
 Random variable set, X = {X1, …, X|X|}
 maximum neighborhood sizes, kmax, khard
Output:
 Inferred graph structure An external file that holds a picture, illustration, etc.
Object name is nihms112676ig1.jpg
for k = 1; kkmax; k + + do
 for Xi [set membership] X do {Add stage}
  Find best new MB variable Xj that maximizes ΔSij
  s.t. |Mi| ≤ k (Eq 6)
 end for
 for Xi [set membership] X do {Swap stage}
  for Xi [set membership] [M with circumflex]ik do
   for Xq [set membership] X \ ([M with circumflex]ik [union or logical sum] {Xi}) and |[M with circumflex]qk| ≤ khard and Xq [negated set membership] tabulist(Xi) do
    Delete {Xi, Xj}, add {Xi, Xq}, add Xj to tabulist(Xi) if swapping Xq for Xj gives maximal score improvement.
   end for
  end for
 end for
end for

2.4. Markov blanket search (MBS) algorithm

The MBS algorithm learns the structure of a MRF by finding the best neighborhood or Markov blanket (MB) for each RV. To identify the best MB, we need to optimize a score, S(Xi[mid ]Mi) per RV Xi, which quantifies dependence between a RV and its MB. Examples of such scores include pseudolikelihood (dependency networks) or conditional entropy (MBCP) (Cover & Thomas, 1991). For example, the best MB identified via conditional entropy, H(Xi[mid ]Mi) is: Mi = arg min[M with circumflex]i H(Xi[mid ][M with circumflex]i). Best MB via pseudolikelihood, PLL(Xi[mid ]Mi), is: Mi = arg max[M with circumflex]i PLL(Xi[mid ][M with circumflex]i)

Dependency networks and MBCP identify the best MB per RV by optimizing S(Xi[mid ]Mi) per RV1. However, optimizing S(Xi[mid ]Mi) per RV independently, may result in inconsistent MBs. In particular, we cannot guarantee that if Xj [set membership] Mi, then Xi [set membership] Mj. This inconsistency can be handled as a post-processing of the learned MBs (Schmidt et al., 2007). However, our experiments suggest that a post-processing approach produces lower quality MBs (Section 3.2).

We propose a different approach that finds consistent MBs during the search process. To find consistent MBs, we search MBs, not only using the improvement in S(Xi[mid ]Mi) on adding Xj, but also the score change in S(Xj[mid ]Mj) if Xi was added to Mj. This is done by computing the net score gain per candidate MB for Xi. Let [M with circumflex]ik−1 denote the best MB for Xi obtained so far. Then the score gain is:

ΔSij=S(XiM^ik1)S(XiM^ik1{Xj})+S(XjM^jk1)S(XjM^jk1{Xi}).
(6)

Our approach is similar to Hofmann & Tresp's edge-based score for guaranteeing consistency (Hofmann & Tresp, 1998). However, their search strategy starts from a fully connected network and removes edges, whereas we add and replace edges starting with a completely disconnected network. For real-world domains, growing larger neighborhoods from smaller neighborhoods is more feasible than shrinking large neighborhoods, because we may not have enough data for reliably learning large neighborhoods.

The MBS structure learning algorithm uses Eq 6 to greedily identify the best MB for each variable (Algo. 1). Each iteration uses a combination of add and swap operations to learn the best structure. In the add stage of the kth iteration, we make one variable extensions to the current [M with circumflex]ik−1 of each Xi, restricting to at most kkmax RVs per MB.

In the swap stage, we revisit all variables Xj [set membership] [M with circumflex]ik−1 for each Xi, and consider other RVs, Xq [negated set membership] ({Xi} [union or logical sum] [M with circumflex]ik), which if swapped in instead of Xj, gives a score improvement. If so, we replace Xj by Xq with the maximal score improvement, in [M with circumflex]ik, and store Xj in the tabu list of Xi. This prevents Xj from being included in Xi's MB in subsequent iterations. In the swap stage, a variable can be present in > kmax MBs. However, no variable can be in more than khard = 20 MBs. Thus, nodes in our inferred networks have degrees ≥ khard, which reasonably models hub nodes in most domains.

The per-variable canonical factor equivalence exploited by MBS to identify the MRF structure does not make any specific assumptions of the parametric form of the conditional probability distributions. MBS only requires that the candidate MBs be scored using the conditional probability distributions. So MBS can potentially be instantiated with any probability distribution and choice of score. For empirical evaluation of our framework, we selected P(Xi[mid ]Mi) to be conditional Gaussians and S(Xi[mid ]Mi) to be the regularized conditional entropy for each Xi: S(Xi[mid ]Mi) = H(Xi[mid ]Mi) + λlog(|Mi|). λlog(|Mi|) penalizes large MBs and 0 ≤ λ ≤ 1 is a regularization coefficient.

3. Results

We compared our Markov Blanket Search (MBS) algorithm against existing algorithms for undirected and directed graphs on both simulated and real data. Our test data is from biology. The simulated data are from networks of known structure, enabling a direct validation of the inferred structures. The real data is from microarray experiments. However, as the true network for the real data is not known, we use biological literature to validate the inferred structures. This test framework is common in bioinformatics (Margolin et al., 2005; Li & Yang, 2005).

3.1. Comparison on simulated datasets

We compared MBS to two undirected algorithms: Algorithm for the Reconstruction of Accurate Cellular Networks (ARACNE) (Margolin et al., 2005), and a Lasso regression-based Graphical Gaussian model (GGLAS) (Li & Yang, 2005). We also compared MBS against several directed models provided in the DAGLearn software2: full DAG search (FULLDAG), LARs based order search (ORDLAS), DAG search using Sparse candidate for pruning (SPCAND), and DAG search using L1 regularization based Markov blanket estimation (L1MB) (Schmidt et al., 2007). Because L1MB does not learn consistent Markov blankets, a post-processing step is required to make the structures consistent. The AND post-processing removes Xi from Mj if Xj [negated set membership] Mi, where Mi and Mj are Xi's and Xj's MB, respectively. The OR post-processing includes Xi in Mj if Xj [set membership] Mi. We refer to L1MB with AND and OR post-processing as MBAND and MBOR, respectively. We also included an implementation of order MCMC (ORDMC) for Bayes net search.3

The simulated datasets were generated by a gene regulatory network simulator using differential equations for describing gene and protein expression dynamics (Roy et al., 2008). We generated four datasets: G50, G75, ECOLI1 and ECOLI2 with n = 100, 150, 188 and 188 nodes, respectively. Each sample consists of steady-state expressions reached after perturbing the kinetic constants of the genes. Networks for G50 and G75 were generated de novo by the simulator. The network for both ECOLI1 and 2 belong to the bacteria, E. coli. In ECOLI2 only a subset of the genes (regulators) are perturbed, whereas in ECOLI1, G50 and G75 all genes are perturbed.

As the true network topologies for these data are known, we compared the algorithms using the match between the inferred and true network structures. Because we are interested in higher-order dependencies, we matched subgraphs rather than edges. Briefly we extracted subgraphs of different types (e.g. cycles, neighborhood) from the true network and used an F-score measure to match the vertex neighborhood and edge set per subgraph. We refer to the scores for vertex neighborhood as V-scores and for edge set as E-scores. We use shortest path neighborhoods (SPN), r-radius neighborhoods comprising a vertex and its neighbors ≤ r steps away (r [set membership] {1, 2}, denoted by 1N and 2N), and cycles of size r (r [set membership] {3, 4}, denoted by 3C and 4C). ECOLI1 and 2 did not have any cycles. We moralize the inferred DAGs prior to comparison.

Our comparison used E and V-scores averaged over four runs per algorithm corresponding to different settings of an algorithm-specific parameter. This parameter is λ in MBS (Section 2.4), data processing inequality d in ARACNE, and hyper-prior parameter for the variance in GGLAS. For all DAG searches other than ORDMC, we used different random restart probabilities to generate different candidate graphs. In our experiments, λ [set membership] {1e − 5, 3e − 5, 5e − 5, 7e − 5} and d [set membership] {0, 0.3, 0.5, 0.7}. All simulated experiments used 1 ≤ kmax ≤ 11. For ORDMC, we varied the edge posterior probability. For each parameter setting, the graph with the highest average of E and V score is used. We compare the best graph per algorithm across different parameter settings.

We show results on two of the four datasets, G50 and ECOLI1 (Table 1). Our complete results are summarized in Table 2. For all datasets other than ECOLI1, MBS significantly beats all algorithms at least as often as it is beaten (Student's t-test, p-value ≤ 0.05). On ECOLI1, ARACNE outperforms all algorithms, suggesting that ECOLI1 likely does not contain many higher-order dependencies. There is no significant difference between MBS and ARACNE on ECOLI2, which is generated from the same network as ECOLI1 using different perturbations.

Table 1
Algorithm comparison on two datasets. Rows give different structure scores; columns are different structure learning algorithms; each entry is an E or V score. Bold* and italics indicate that MBS performs significantly better or worse than the algorithm ...
Table 2
Number of times MBS loses/beats statistically significantly another algorithm. Rows are for different datasets. GGLAS did not complete on ECOLI2.

We find that the performance margin between MBS and the DAG learning models is greater than undirected learning algorithms, suggesting undirected graphs may be better representations for this domain. Overall, MBS does a better job of learning the network structures compared to both directed and undirected algorithms for majority of the datasets.

3.2. Structural consistency for pruning DAGs

To assess the value of enforcing consistency during learning, rather than as a post-processing step, we used the MBS-learned Markov blankets as family constraints in DAG search algorithms. We compared the DAG structures constrained using MBS Markov blankets against those constrained by Sparse candidate (SPCAND) and L1 MB regularization (L1MB). L1MB uses either an OR or AND of the Markov blankets to generate consistent Markov blankets.

We used the maximum size of L1MB AND and OR Markov blankets as the neighborhood size, k, for MBS and SPCAND. We first compared L1MB with OR post-processing (MBOR) using k = 11 for both G50 and G75 (Table 3). We found the DAGs constrained by MBS-learned Markov blankets to significantly outperform both SPCAND or L1MB-constrained DAGs more often than being outperformed. Using L1MB AND (k = 4, 6 for G50 and G75, respectively) MBS outperformed SPCAND or L1MB with a greater margin. This indicates that enforcing consistency, during structure learning produces higher-quality Markov blankets, than as a post-processing step.

Table 3
Comparison of MBS pruning against Sparse candidate and L1 MB regularization. Legend same as Table 1.

3.3. Comparison on real biological data

We compared MBS against ARACNE and SPCAND on real-world biological data. GGLAS did not complete within 48 hrs on this data, so is omitted. Each dataset measures the gene expression response of two different populations of yeast cells, Quiescent (Q) and Non-quiescent (NQ), to genetic perturbations (Aragon et al., 2008). Each dataset had a biological replicate, resulting in four datasets: Q1, Q2, NQ1 and NQ2. We pre-processed these data to include n = 1808 genes with < 80% missing data. We used MBS with λ = 10−4, kmax = 4, ARACNE with dpi = 0.3 and SPCAND with 4 parents. A relatively large value of λ, compared to that in simulated networks was used because of the large number of nodes in this data.

As the true network is not known, we used statistical enrichment of subgraphs generated from the inferred networks, in biological processes from Gene ontology (GO), to assess the quality of the networks. For each inferred network, we generated neighborhood subgraphs of radius r = 1. For each subgraph g and term t, we computed a hyper-geometric p-value, assessing the statistical significance of observing g's genes to be annotated with t. The lower the p-value the better is the statistical enrichment. We first computed the number of GO terms MBS had better (lower p-value) or worse enrichment compared to other algorithms (Table 4). We found that MBS had more terms with better enrichment in all except one case (ARACNE, NQ2). This suggests that networks identified via MBS capture more significant biological dependencies.

Table 4
Number of GO terms MBS has worse/better enrichment than other algorithms.

We also compared the algorithms using two other measures (Fig 1). Enrichment sensitivity is the ratio of the min(no. of subgraphs, no. of enriched terms) to the total number of subgraphs. Enrichment locality is the correlation between average p-value of a term and the number of subgraphs enriched in that term. A positive correlation suggests that terms with higher p-values (less enriched) are associated with many subgraphs, whereas terms with lower p-values (more enriched) are associated with fewer subgraphs. Ideally, an algorithm should identify good enrichment for the majority of subgraphs (high sensitivity), and also associate highly enriched terms with a few subgraphs (high locality).

Figure 1
Enrichment sensitivity and locality for significance p < 10−3. Higher values are better.

We used different stringency levels of enrichment (p-value [set membership] {10−3, 10−4, 10−5, 10−6}) to assess the enrichment sensitivity and locality of the algorithms on all four datasets (Fig. 3.2 shows the sensitivity and locality for p-value< 10−3). At p-value < 10−3 and < 10−4, ARACNE and SPCAND had significantly higher sensitivity, but significantly lower locality than MBS (Wilcoxon rank sum, p < 0.05). However, there was no statistical difference for higher stringency of enrichment. These results suggest that there is a trade off between different algorithms for biological data. MBS identifies subgraphs that are locally coherent at the cost of having fewer subgraphs that are enriched in a term. On the other hand, ARACNE and SPCAND identify more subgraphs with enrichment, but may overly fragment coherent gene groups. Finally, there is no significant difference between algorithms at higher stringency, suggesting that the algorithms agree on the GO terms that are the most significant.

4. Conclusion

We have described a new algorithm for inferring undirected graphs that yields structurally consistent graphs, guaranteeing a joint probability distribution for the RVs. We compared our algorithm to several algorithms for learning undirected and directed models. On simulated data, we show that enforcing consistency during structure learning more accurately captures the graph structure. Our approach also produces higher-quality Markov blankets, that when used to prune DAG searches, yields better structures. On real data, MBS identifies more significant ontology terms associated with functionally coherent gene groups.

Acknowledgments

This work is supported by grants from NIMH (1R01MH076282-03) and NSF (IIS-0705681) to T.L., from NIH (GM-67593) and NSF (MCB0734918) to M.W.W. and from HHMI-NIH/NIBIB (56005678).

Footnotes

1In MBCP estimation, MBs of variable sets are identified independently. MBCP requires an additional subset consistency check: if X [subset or is implied by] Y, then MX [subset or is implied by] (MY [union or logical sum] (Y \ X))

2http://www.cs.ubc.ca/~murphyk/Software/DAGlearn/

3http://www.bioss.ac.uk/staff/.adriano/comparison/comparison.html

References

  • Abbeel P, Koller D, Ng AY. Learning factor graphs in polynomial time and sample complexity. JMLR. 2006;7:1743–1788.
  • Aragon AD, Rodriguez AL, Meirelles O, Roy S, Davidson GS, Tapia PH, Allen C, Joe R, Benn D, Werner-Washburne M. Characterization of differentiated quiescent and nonquiescent cells in yeast stationary-phase cultures. Mol Biol Cell. 2008 E07–07–0666+ [PMC free article] [PubMed]
  • Besag J. Efficiency of pseudolikelihood estimation for simple gaussian fields. Biometrika. 1977;64:616–618.
  • Cover TM, Thomas JA. Elements of information theory. New York, NY, USA: Wiley-Interscience; 1991.
  • Friedman N, Nachman I, Pe'er D. Learning bayesian network structure from massive datasets: The sparse candidate algorithm. Uncertainty in Artificial Intelligence 1999
  • Heckerman D, Chickering DM, Meek C, Rounthwaite R, Kadie CM. Dependency networks for inference, collaborative filtering, and data visualization. Journal of Machine Learning Research. 2000;1:49–75.
  • Hofmann R, Tresp V. NIPS '97: Proceedings of the 1997 conference on Advances in neural information processing systems. Vol. 10. Cambridge, MA, USA: MIT Press; 1998. Nonlinear markov networks for continuous variables; pp. 521–527.
  • Lauritzen SL. Graphical models. New York, USA: Oxford University Press; 1996. (Oxford Statistical Science).
  • Lee SI, Ganapathi V, Koller D. Efficient structure learning of markov networks using l1-regularization. In: Schölkopf B, Platt J, Hoffman T, editors. Advances in neural information processing systems. Vol. 19. Cambridge, MA: MIT Press; 2007. pp. 817–824.
  • Li F, Yang Y. Using modified lasso regression to learn large undirected graphs in a probabilistic framework. AAAI. 2005:801–806.
  • Margolin A, Nemenman I, Basso K, Wiggins C, Stolovitzky G, Favera RD, Califano A. Aracne: An algorithm for the reconstruction of gene regulatory networks in a mammalian cellular context. BMC Bioinformatics. 2005;(Suppl 1):S7. [PMC free article] [PubMed]
  • Paget RD. Doctoral dissertation. University of Queensland; 1999. Nonparametric markov random field models for natural texture images.
  • Roy S, Werner-Washburne M, Lane T. A system for generating transcription regulatory networks with combinatorial control of transcription. Bioinformatics (Oxford, England) 2008 [PMC free article] [PubMed]
  • Schmidt M, Niculescu-Mizil A, Murphy K. Learning graphical model structure using l1-regularization paths. Twenty second international conference on AI 2007