Home | About | Journals | Submit | Contact Us | Français |

**|**HHS Author Manuscripts**|**PMC2872253

Formats

Article sections

Authors

Related links

Proc Int Conf Mach Learn. Author manuscript; available in PMC 2010 May 18.

Published in final edited form as:

Proc Int Conf Mach Learn. 2009; 382: 905–912.

doi: 10.1145/1553374.1553490PMCID: PMC2872253

NIHMSID: NIHMS112676

Sushmita Roy: ude.mnu.sc@yors; Terran Lane: ude.mnu.sc@narret; Margaret Werner-Washburne: ude.mnu@wweiggam

In many real-world domains, undirected graphical models such as Markov random fields provide a more natural representation of the statistical dependency structure than directed graphical models. Unfortunately, structure learning of undirected graphs using likelihood-based scores remains difficult because of the intractability of computing the partition function. We describe a new Markov random field structure learning algorithm, motivated by canonical parameterization of Abbeel *et al.* We provide computational improvements on their parameterization by learning per-variable canonical factors, which makes our algorithm suitable for domains with hundreds of nodes. We compare our algorithm against several algorithms for learning undirected and directed models on simulated and real datasets from biology. Our algorithm frequently outperforms existing algorithms, producing higher-quality structures, suggesting that enforcing consistency during structure learning is beneficial for learning undirected graphs.

Probabilistic graphical models (PGMs) representing real-world networks capture important structural and functional aspects of the network by describing a joint probability distribution of all node measurements. The structure encodes conditional independence assumptions allowing the joint probability distribution to be tractably computed. When the structure is unknown, likelihood-based structure learning algorithms are employed to infer the structure from observed data.

Likelihood-based structure learning of directed acyclic graphs (DAGs), such as Bayesian networks, is widely used because the likelihood score can be tractably computed for all candidate DAGs. However, in many domains such as biology, causal implication of directed edges is difficult to ascertain without perturbations, leaving only a correlation implication. In such situations, undirected graphical models are a more natural representation of statistical dependencies. Unfortunately, likelihood-based structure learning of these models is much harder due to the intractability of the partition function (Abbeel et al., 2006).

To overcome this issue, researchers have opted several alternatives: learn graphical Gaussian models where the likelihood can be computed tractably (Li & Yang, 2005); restrict to lower order, often pairwise functions, (Margolin et al., 2005; Lee et al., 2007); use pseudolikelihood as structure score instead of likelihood (Besag, 1977); learn dependency networks (Heckerman et al., 2000; Schmidt et al., 2007); or learn Markov blanket canonical factors (Abbeel et al., 2006). Pairwise models are scalable, but, approximate higher-order dependencies by pairwise functions, which is limiting for domains where higher-order dependencies occur commonly. While dependency networks are scalable, each variable neighborhood is estimated independently, resulting in inconsistent structures when the data sample size is small. This is problematic for real-world data which often lack sufficient samples to guarantee a consistent joint probability distribution for the learned structure. Finally, Markov blanket canonical parameterization requires exhaustive enumeration of variable subsets up to a pre-specified size *l*, which is not scalable for networks with hundreds of nodes.

We have developed a new algorithm for learning undirected graphical models, that produces consistent structures and is scalable to be applicable for real-world domains. Our algorithm, Markov blanket search (MBS) is inspired by Abbeel *et al.*'s Markov blanket canonical parameterization, which established an equivalence between global canonical potentials and local Markov blanket canonical factors (MBCFs) (Abbeel et al., 2006). We extend Abbeel *et al.*'s result to establish further equivalence between MBCFs and *per-variable canonical factors*. Because per-variable canonical factors require learning Markov blankets *per-variable*, rather than all subsets up to size *l*, we save *O*(*n ^{l}*

We compare our algorithm against two existing algorithms for learning undirected models: Algorithm for the Reconstruction of Accurate Cellular Networks (ARACNE) (Margolin et al., 2005), and a Lasso regression based dependency network algorithm (GGLAS) (Li & Yang, 2005). ARACNE learns pairwise dependencies, whereas GGLAS learns both pairwise and higher-order dependencies. On simulated data from networks of known topology, MBS captures the structure better than ARACNE for the majority of the datasets. Although GGLAS and MBS are often tied in performance, GGLAS's assumption that variable ordering is irrelevant, is true only for the Gaussian distribution. MBS uses a more general framework of per-variable canonical factors, which can be used with any conditional probability distribution family.

We also compare MBS to several algorithms for learning DAG structures. MBS not only outperforms the algorithms performing DAG searches, but provides a better pruning of the structure search space than the L1 regularization-based Markov blanket and sparse candidate algorithms (Schmidt et al., 2007; Friedman et al., 1999). This suggests that learning consistent structures during structure search is better than post-processing learned structures to enforce consistency. We finally apply ARACNE, MBS and the sparse candidate algorithm to four real-world microarray data sets. Subgraphs generated from MBS-inferred networks represent more biologically meaningful dependencies than subgraphs from the other algorithms.

To summarize, MBS has the following advantages: (a) it captures both higher-order and pairwise dependencies, (b) it learns consistent structures ensuring the existence of a joint distribution, (c) it provides a tractable implementation of the theoretical framework of Abbeel *et al.*. This final property allows MBS to scale to real-world domains with hundreds of nodes.

A Markov random field (MRF) is an undirected, probabilistic graphical model that represents statistical dependencies among a set of random variables (RVs), **X** = {*X*_{1}, …, *X _{n}*}. A MRF consists of a graph and a set of potential functions

Structure learning of MRFs using likelihood is difficult in general because of *Z* (Abbeel et al., 2006). This is because estimating *Z* requires a sum of exponentially many joint configurations of the RVs, making it intractable for real-world domains. To overcome this problem, researchers have proposed approaches that use pseudolikelihood (Besag, 1977; Heckerman et al., 2000), or, have used Markov blanket canonical parameterization (MBCP) (Abbeel et al., 2006). We use an approach similar to MBCP, which requires the estimation of optimal Markov blankets for RV subsets, **Y** **X**, |**Y**| ≤ *l*, where *l* is a pre-specified, maximum subset size. However, we learn local per-variable factors, requiring estimation of Markov blankets of only individual RVs. Avoiding Markov blanket estimation of all subsets, makes our approach scalable to domains with hundreds of nodes.

The Hammersly-Clifford theorem establishes a one-to-one relationship between MRFs and strictly positive distributions such as the Gibbs distributions. The *canonical potentials* (also called -potentials (Paget, 1999)) are used together with the Möbius inversion theorem to prove the Hammersly-Clifford theorem (Lauritzen, 1996). The canonical potential for a subset **D** **X** is defined using a *default joint instantiation*, = {_{1}, …, _{|X|}} to **X** as:

$${\psi}_{\mathbf{\text{D}}}^{\ast}\left(\mathbf{\text{D}}=\mathbf{\text{d}}\right)=\text{exp}({\sum _{\mathbf{\text{U}}\subseteq \mathbf{\text{D}}}(-1)}^{|\mathbf{\text{D}}\backslash \mathbf{\text{U}}|}\text{log}P(\mathbf{\text{X}}=\sigma (\mathbf{\text{U}},\mathbf{\text{X}},\mathbf{\text{d}})))$$

where *σ*(**A**, **B, c**) is an assignment function to variables *X _{k}*

The joint probability distribution for a MRF using canonical potentials is defined to be:
$P\left(\mathbf{\text{X}}=\mathbf{\text{x}}\right)=P(\overline{\mathbf{\text{x}}}){\prod}_{\mathbf{\text{D}}\in \mathcal{C}}{\psi}_{\mathbf{\text{D}}}^{\ast}$, where is the set of maximal cliques in the graph. This is true by an application of the Möbius inversion and setting
${\psi}_{\mathbf{\text{D}}}^{\ast}=0$ for all **D** (Lauritzen, 1996; Paget, 1999).

The computation of the canonical potentials is not feasible for real-world domains as they require the estimation of the full joint distribution (Abbeel et al., 2006). Markov Blanket canonical parameterization, developed by Abbeel *et al.*, allows the computation of global canonical potentials over **X**, using local conditional functions called *Markov blanket canonical factors* (MBCFs).

The MBCF, for a set **D** **X** is estimated using **D** and its Markov blanket (MB). The MB, **M**_{i} of a variable *X _{i}*, is the set of immediate neighbors of

$${\stackrel{\sim}{\psi}}_{\mathbf{\text{D}}}(\mathbf{\text{D}}=\mathbf{\text{d}})=\text{exp}({\sum _{\mathbf{\text{U}}\subseteq \mathbf{\text{D}}}(-1)}^{|\mathbf{\text{D}}\backslash \mathbf{\text{U}}|}\text{log}P(\mathbf{\text{D}}=\sigma (\mathbf{\text{U}},\mathbf{\text{D}},\mathbf{\text{d}})\mid {\mathbf{\text{M}}}_{\mathbf{\text{D}}}=\sigma (\mathbf{\text{U}},{\mathbf{\text{M}}}_{\mathbf{\text{D}}},\mathbf{\text{d}}))),$$

(1)

For MRFs of unknown structure, MBCFs are identified by searching exhaustively among all subsets **F**_{i} **X**, up to size *l* and finding MBs for each **F**_{i}. Unfortunately, exhaustive enumeration of variable subsets becomes impractical for moderately sized networks (Abbeel et al., 2006). We show that the MBCFs can be further reduced to smaller per-variable canonical factors, which are computed using an RV and its Markov blanket.

We now show that the MBCFs can be replaced by smaller, local functions: *per-variable MB canonical factors*, which does not require enumeration of all subsets up to size *l*. Specifically, for every MB canonical factor there exists an equivalent per-variable canonical factor *ψ*^{+}. To illustrate how the per-variable factors are derived from MBCFs, we first consider a specific case of **D** = {*X _{i}*,

Let **D** = {*X _{i}*,

$$\begin{array}{r}{\sum _{\mathbf{\text{U}}\subseteq \mathbf{\text{D}}}(-1)}^{|\mathbf{\text{D}}\backslash \mathbf{\text{U}}|}\text{log}P(\{{X}_{i},{X}_{j}\}=\sigma (\mathbf{\text{U}},\{{X}_{i},{X}_{j}\},\mathbf{\text{d}})\mid {\mathbf{\text{M}}}_{\mathbf{\text{D}}}={\overline{\mathbf{\text{m}}}}_{\mathbf{\text{d}}})\\ ={(-1)}^{|\{{X}_{i},{X}_{j}\}|}\text{log}P({X}_{i}={\overline{x}}_{i},{X}_{j}={\overline{x}}_{j}\mid {\mathbf{\text{M}}}_{\mathbf{\text{D}}}={\overline{\mathbf{\text{m}}}}_{\mathbf{\text{d}}})\\ +{(-1)}^{|\{{X}_{j}\}|}\text{log}P({X}_{i}={x}_{i},{X}_{j}={\overline{x}}_{j}\mid {\mathbf{\text{M}}}_{\mathbf{\text{D}}}={\overline{\mathbf{\text{m}}}}_{\mathbf{\text{d}}})\\ {(-1)}^{|\{{X}_{i}\}|}\text{log}P({X}_{i}={\overline{x}}_{i},{X}_{j}={x}_{j}\mid {\mathbf{\text{M}}}_{\mathbf{\text{D}}}={\overline{\mathbf{\text{m}}}}_{\mathbf{\text{d}}})\\ +{(-1)}^{|\varnothing |}\text{log}P({X}_{i}={x}_{i},{X}_{j}={x}_{j}\mid {\mathbf{\text{M}}}_{\mathbf{\text{D}}}={\overline{\text{m}}}_{\mathbf{\text{d}}})\end{array}$$

(2)

where the first term corresponds to *U* = Ø, the second term corresponds to *U* = {*X _{i}*} and so on. Applying the chain rule to every term in the RHS:

$$\begin{array}{r}=\text{log}[P({X}_{i}={\overline{x}}_{i}\mid {X}_{j}={\overline{x}}_{j},{\mathbf{\text{M}}}_{\mathbf{\text{D}}}={\overline{\mathbf{\text{m}}}}_{\mathbf{\text{d}}})\phantom{\rule{0.2em}{0ex}}P({X}_{j}={\overline{x}}_{j}\mid {\mathbf{\text{M}}}_{\mathbf{\text{D}}}={\overline{\mathbf{\text{m}}}}_{\mathbf{\text{d}}})]\\ -\text{log}[P({X}_{i}={x}_{i}\mid {X}_{j}={\overline{x}}_{j},{\mathbf{\text{M}}}_{\mathbf{\text{D}}}={\overline{\mathbf{\text{m}}}}_{\mathbf{\text{d}}})\phantom{\rule{0.2em}{0ex}}P({X}_{j}={\overline{x}}_{j}\mid {\mathbf{\text{M}}}_{\mathbf{\text{D}}}={\overline{\mathbf{\text{m}}}}_{\mathbf{\text{d}}})]\\ -\text{log}[P({X}_{i}={\overline{x}}_{i}\mid {X}_{j}={x}_{j},{\mathbf{\text{M}}}_{\mathbf{\text{D}}}={\overline{\mathbf{\text{m}}}}_{\mathbf{\text{d}}})\phantom{\rule{0.2em}{0ex}}P({X}_{j}={x}_{j}\mid {\mathbf{\text{M}}}_{\mathbf{\text{D}}}={\overline{\mathbf{\text{m}}}}_{\mathbf{\text{d}}})]\\ +\text{log}[P({X}_{i}={x}_{i}\mid {X}_{j}={x}_{j},{\mathbf{\text{M}}}_{\mathbf{\text{D}}}={\overline{\mathbf{\text{m}}}}_{\mathbf{\text{d}}})\phantom{\rule{0.2em}{0ex}}P({X}_{j}={x}_{j}\mid {\mathbf{\text{M}}}_{\mathbf{\text{D}}}={\overline{\mathbf{\text{m}}}}_{\mathbf{\text{d}}})]\end{array}$$

We find that all log*P*(*X _{j}*

$$\begin{array}{r}=\text{log}P({X}_{i}={\overline{x}}_{i},\mid {X}_{j}={\overline{x}}_{j},{\mathbf{\text{M}}}_{\mathbf{\text{D}}}={\overline{\mathbf{\text{m}}}}_{\mathbf{\text{d}}})\\ -\text{log}P({X}_{i}={x}_{i}\mid {X}_{j}={\overline{x}}_{j},{\mathbf{\text{M}}}_{\mathbf{\text{D}}}={\overline{\mathbf{\text{m}}}}_{\mathbf{\text{d}}})\\ -\text{log}P({X}_{i}={\overline{x}}_{i}\mid {X}_{j}={x}_{j},{\mathbf{\text{M}}}_{\mathbf{\text{D}}}={\overline{\mathbf{\text{m}}}}_{\mathbf{\text{d}}})\\ +\text{log}P({X}_{i}={x}_{i}\mid {X}_{j}={x}_{j},{\mathbf{\text{M}}}_{\mathbf{\text{D}}}={\overline{\mathbf{\text{m}}}}_{\mathbf{\text{d}}})\end{array}$$

(3)

This allows to be rewritten as:

$${\stackrel{\sim}{\psi}}_{\mathbf{\text{D}}}(\mathbf{\text{D}}=\mathbf{\text{d}})=\text{exp}\left({\sum _{\mathbf{\text{U}}\subseteq \mathbf{\text{D}}}(-1)}^{|\mathbf{\text{D}}\backslash \mathbf{\text{U}}|}\text{log}P({X}_{i}=\sigma (\mathbf{\text{U}},\{{X}_{i}\},\mathbf{\text{d}})\mid \{{X}_{j}\}\cup {\mathbf{\text{M}}}_{\mathbf{\text{D}}}=\sigma (\mathbf{\text{U}},\{{X}_{j}\}\cup {\mathbf{\text{M}}}_{\mathbf{\text{D}}},\mathbf{\text{d}}))\right)$$

(4)

We assert further independence in Eq 4 because *X _{i}* is independent of all variables other than

$${\psi}_{\mathbf{\text{D}}}^{+}(\mathbf{\text{D}}=\mathbf{\text{d}})=\text{exp}({\sum _{\mathbf{\text{U}}\subseteq \mathbf{\text{D}}}(-1)}^{|\mathbf{\text{D}}\backslash \mathbf{\text{U}}|}\text{log}P({X}_{i}=\sigma (\mathbf{\text{U}},\{{X}_{i}\},\mathbf{\text{d}})\mid {\mathbf{\text{M}}}_{i}=\sigma (\mathbf{\text{U}},{\mathbf{\text{M}}}_{i},\mathbf{\text{d}})))$$

(5)

Thus we have equivalent the per-variable canonical factor, *ψ*^{+} from the original MBCF in Eq 1.

We now state the equivalence between per-variable and MB canonical factors more formally:

**Theorem 2.1** Every MBCP, _{D}, of the form in Eq 1 possesses an equivalent per-variable factor,
${\psi}_{\mathbf{\text{D}}}^{+}(\mathbf{\text{D}}=\mathbf{\text{d}})=\mathit{\text{exp}}({{\sum}_{\mathbf{\text{U}}\subseteq \mathbf{\text{D}}}(-1)}^{|\mathbf{\text{D}}\backslash \mathbf{\text{U}}|}\mathit{\text{log}}P({X}_{i}=\sigma (\mathbf{\text{U}},\{{X}_{i}\},\mathbf{\text{d}})\mid {\mathbf{\text{M}}}_{i}=\sigma (\mathbf{\text{U}},{\mathbf{\text{M}}}_{i},\mathbf{\text{d}})))$, where X_{i} **D**.

**Proof** The proof of this equivalence involves two steps: (a) deriving *ψ*^{+} from for any general **D**, and (b) identifying neighbors of an RV and making independence assertions described by the graph structure.

To prove (a) we select an arbitrary *X _{i}*

The final step is to identify the neighbors of *X _{i}* and using the local Markov property,

The equivalence of the per-variable factors and MBCFs implies that, instead of searching over all size *l* subsets of **X**, we can estimate canonical factors by searching for MBs of individual RVs. Assuming that the MBs are estimated correctly, Eq 5 will produce the same canonical factors as Eq 1. Our structure learning algorithm therefore requires the estimation of MBs of each RV. We only need to ensure structural consistency (Section 2.4). Searching for *n* MBs, as opposed to *n ^{l}* MBs in MBCF, saves us

The per-variable canonical factors and MBCFs do not deny the hardness of computing likelihood in MRFs (Abbeel et al., 2006). This is because computing *P*(**X** = ) is equivalent to computing
$\frac{1}{Z}$.

Input: |

Random variable set, X = {X_{1}, …, X_{|}_{X}_{|}} |

maximum neighborhood sizes, k, _{max}k_{hard} |

Output: |

Inferred graph structure |

for k = 1; k ≤ k; _{max}k + + do |

for X _{i}X do {Add stage} |

Find best new MB variable X that maximizes Δ_{j}S_{ij} |

s.t. |M_{i}| ≤ k (Eq 6) |

end for |

for X _{i}X do {Swap stage} |

for X _{i}_{i}^{k} do |

for X _{q}X \ (_{i}^{k} {X}) and |_{i}_{q}^{k}| ≤ k and _{hard}X tabulist(_{q}X) _{i}do |

Delete {X, _{i}X}, add {_{j}X, _{i}X}, add _{q}X to tabulist(_{j}X) if swapping _{i}X for _{q}X gives maximal score improvement._{j} |

end for |

end for |

end for |

end for |

The MBS algorithm learns the structure of a MRF by finding the best neighborhood or Markov blanket (MB) for each RV. To identify the best MB, we need to optimize a *score*, *S*(*X _{i}*

Dependency networks and MBCP identify the best MB per RV by optimizing *S*(*X _{i}*

We propose a different approach that finds consistent MBs during the search process. To find consistent MBs, we search MBs, not only using the improvement in *S*(*X _{i}*

$${\mathrm{\Delta}S}_{ij}=S({X}_{i}\mid {{\widehat{\mathbf{\text{M}}}}_{i}}^{k-1})-S({X}_{i}\mid {{\widehat{\mathbf{\text{M}}}}_{i}}^{k-1}\cup \{{X}_{j}\})+S({X}_{j}\mid {{\widehat{\mathbf{\text{M}}}}_{j}}^{k-1})-S({X}_{j}\mid {{\widehat{\mathbf{\text{M}}}}_{j}}^{k-1}\cup \{{X}_{i}\}).$$

(6)

Our approach is similar to Hofmann & Tresp's edge-based score for guaranteeing consistency (Hofmann & Tresp, 1998). However, their search strategy starts from a fully connected network and removes edges, whereas we add and replace edges starting with a completely disconnected network. For real-world domains, growing larger neighborhoods from smaller neighborhoods is more feasible than shrinking large neighborhoods, because we may not have enough data for reliably learning large neighborhoods.

The MBS structure learning algorithm uses Eq 6 to greedily identify the best MB for each variable (Algo. 1). Each iteration uses a combination of *add* and *swap* operations to learn the best structure. In the add stage of the *k ^{th}* iteration, we make one variable extensions to the current

In the swap stage, we revisit all variables *X _{j}*

The per-variable canonical factor equivalence exploited by MBS to identify the MRF structure does not make any specific assumptions of the parametric form of the conditional probability distributions. MBS only requires that the candidate MBs be scored using the conditional probability distributions. So MBS can potentially be instantiated with any probability distribution and choice of score. For empirical evaluation of our framework, we selected *P*(*X _{i}*

We compared our Markov Blanket Search (MBS) algorithm against existing algorithms for undirected and directed graphs on both simulated and real data. Our test data is from biology. The simulated data are from networks of known structure, enabling a direct validation of the inferred structures. The real data is from microarray experiments. However, as the true network for the real data is not known, we use biological literature to validate the inferred structures. This test framework is common in bioinformatics (Margolin et al., 2005; Li & Yang, 2005).

We compared MBS to two undirected algorithms: Algorithm for the Reconstruction of Accurate Cellular Networks (ARACNE) (Margolin et al., 2005), and a Lasso regression-based Graphical Gaussian model (GGLAS) (Li & Yang, 2005). We also compared MBS against several directed models provided in the DAGLearn software^{2}: full DAG search (FULLDAG), LARs based order search (ORDLAS), DAG search using Sparse candidate for pruning (SPCAND), and DAG search using L1 regularization based Markov blanket estimation (L1MB) (Schmidt et al., 2007). Because L1MB does not learn consistent Markov blankets, a post-processing step is required to make the structures consistent. The AND post-processing removes *X _{i}* from

The simulated datasets were generated by a gene regulatory network simulator using differential equations for describing gene and protein expression dynamics (Roy et al., 2008). We generated four datasets: G50, G75, ECOLI1 and ECOLI2 with *n* = 100, 150, 188 and 188 nodes, respectively. Each sample consists of steady-state expressions reached after perturbing the kinetic constants of the genes. Networks for G50 and G75 were generated *de novo* by the simulator. The network for both ECOLI1 and 2 belong to the bacteria, *E. coli*. In ECOLI2 only a subset of the genes (regulators) are perturbed, whereas in ECOLI1, G50 and G75 all genes are perturbed.

As the true network topologies for these data are known, we compared the algorithms using the match between the inferred and true network structures. Because we are interested in higher-order dependencies, we matched subgraphs rather than edges. Briefly we extracted subgraphs of different types (e.g. cycles, neighborhood) from the true network and used an F-score measure to match the vertex neighborhood and edge set per subgraph. We refer to the scores for vertex neighborhood as *V-scores* and for edge set as *E-scores*. We use shortest path neighborhoods (SPN), *r*-radius neighborhoods comprising a vertex and its neighbors ≤ *r* steps away (*r* {1, 2}, denoted by 1N and 2N), and cycles of size *r* (*r* {3, 4}, denoted by 3C and 4C). ECOLI1 and 2 did not have any cycles. We moralize the inferred DAGs prior to comparison.

Our comparison used E and V-scores averaged over four runs per algorithm corresponding to different settings of an algorithm-specific parameter. This parameter is *λ* in MBS (Section 2.4), data processing inequality *d* in ARACNE, and hyper-prior parameter for the variance in GGLAS. For all DAG searches other than ORDMC, we used different random restart probabilities to generate different candidate graphs. In our experiments, *λ* {1*e* − 5, 3*e* − 5, 5*e* − 5, 7*e* − 5} and *d* {0, 0.3, 0.5, 0.7}. All simulated experiments used 1 ≤ *k _{max}* ≤ 11. For ORDMC, we varied the edge posterior probability. For each parameter setting, the graph with the highest average of E and V score is used. We compare the best graph per algorithm across different parameter settings.

We show results on two of the four datasets, G50 and ECOLI1 (Table 1). Our complete results are summarized in Table 2. For all datasets other than ECOLI1, MBS significantly beats all algorithms at least as often as it is beaten (Student's t-test, p-value ≤ 0.05). On ECOLI1, ARACNE outperforms all algorithms, suggesting that ECOLI1 likely does not contain many higher-order dependencies. There is no significant difference between MBS and ARACNE on ECOLI2, which is generated from the same network as ECOLI1 using different perturbations.

Algorithm comparison on two datasets. Rows give different structure scores; columns are different structure learning algorithms; each entry is an E or V score. **Bold*** and *italics* indicate that MBS performs significantly **better** or *worse* than the algorithm **...**

Number of times MBS loses/beats statistically significantly another algorithm. Rows are for different datasets. GGLAS did not complete on ECOLI2.

We find that the performance margin between MBS and the DAG learning models is greater than undirected learning algorithms, suggesting undirected graphs may be better representations for this domain. Overall, MBS does a better job of learning the network structures compared to both directed and undirected algorithms for majority of the datasets.

To assess the value of enforcing consistency during learning, rather than as a post-processing step, we used the MBS-learned Markov blankets as family constraints in DAG search algorithms. We compared the DAG structures constrained using MBS Markov blankets against those constrained by Sparse candidate (SPCAND) and L1 MB regularization (L1MB). L1MB uses either an OR or AND of the Markov blankets to generate consistent Markov blankets.

We used the maximum size of L1MB AND and OR Markov blankets as the neighborhood size, *k*, for MBS and SPCAND. We first compared L1MB with OR post-processing (MBOR) using *k* = 11 for both G50 and G75 (Table 3). We found the DAGs constrained by MBS-learned Markov blankets to significantly outperform both SPCAND or L1MB-constrained DAGs more often than being outperformed. Using L1MB AND (*k* = 4, 6 for G50 and G75, respectively) MBS outperformed SPCAND or L1MB with a greater margin. This indicates that enforcing consistency, during structure learning produces higher-quality Markov blankets, than as a post-processing step.

We compared MBS against ARACNE and SPCAND on real-world biological data. GGLAS did not complete within 48 hrs on this data, so is omitted. Each dataset measures the gene expression response of two different populations of yeast cells, *Quiescent* (Q) and *Non-quiescent* (NQ), to genetic perturbations (Aragon et al., 2008). Each dataset had a biological replicate, resulting in four datasets: Q1, Q2, NQ1 and NQ2. We pre-processed these data to include *n* = 1808 genes with < 80% missing data. We used MBS with *λ* = 10^{−4}, *k _{max}* = 4, ARACNE with dpi = 0.3 and SPCAND with 4 parents. A relatively large value of

As the true network is not known, we used statistical enrichment of subgraphs generated from the inferred networks, in biological processes from Gene ontology (GO), to assess the quality of the networks. For each inferred network, we generated neighborhood subgraphs of radius *r* = 1. For each subgraph *g* and term *t*, we computed a hyper-geometric *p*-value, assessing the statistical significance of observing *g*'s genes to be annotated with *t*. The lower the *p*-value the better is the statistical enrichment. We first computed the number of GO terms MBS had better (lower *p*-value) or worse enrichment compared to other algorithms (Table 4). We found that MBS had more terms with better enrichment in all except one case (ARACNE, NQ2). This suggests that networks identified via MBS capture more significant biological dependencies.

We also compared the algorithms using two other measures (Fig 1). *Enrichment sensitivity* is the ratio of the min(no. of subgraphs, no. of enriched terms) to the total number of subgraphs. *Enrichment locality* is the correlation between average *p*-value of a term and the number of subgraphs enriched in that term. A positive correlation suggests that terms with higher *p*-values (less enriched) are associated with many subgraphs, whereas terms with lower *p*-values (more enriched) are associated with fewer subgraphs. Ideally, an algorithm should identify good enrichment for the majority of subgraphs (high sensitivity), and also associate highly enriched terms with a few subgraphs (high locality).

We used different stringency levels of enrichment (*p*-value {10^{−3}, 10^{−4}, 10^{−5}, 10^{−6}}) to assess the enrichment sensitivity and locality of the algorithms on all four datasets (Fig. 3.2 shows the sensitivity and locality for *p*-value< 10^{−3}). At *p*-value < 10^{−3} and < 10^{−4}, ARACNE and SPCAND had significantly higher sensitivity, but significantly lower locality than MBS (Wilcoxon rank sum, *p* < 0.05). However, there was no statistical difference for higher stringency of enrichment. These results suggest that there is a trade off between different algorithms for biological data. MBS identifies subgraphs that are locally coherent at the cost of having fewer subgraphs that are enriched in a term. On the other hand, ARACNE and SPCAND identify more subgraphs with enrichment, but may overly fragment coherent gene groups. Finally, there is no significant difference between algorithms at higher stringency, suggesting that the algorithms agree on the GO terms that are the most significant.

We have described a new algorithm for inferring undirected graphs that yields structurally consistent graphs, guaranteeing a joint probability distribution for the RVs. We compared our algorithm to several algorithms for learning undirected and directed models. On simulated data, we show that enforcing consistency during structure learning more accurately captures the graph structure. Our approach also produces higher-quality Markov blankets, that when used to prune DAG searches, yields better structures. On real data, MBS identifies more significant ontology terms associated with functionally coherent gene groups.

This work is supported by grants from NIMH (1R01MH076282-03) and NSF (IIS-0705681) to T.L., from NIH (GM-67593) and NSF (MCB0734918) to M.W.W. and from HHMI-NIH/NIBIB (56005678).

^{1}In MBCP estimation, MBs of variable sets are identified independently. MBCP requires an additional subset consistency check: if **X** **Y**, then **M _{X}** (

^{2}http://www.cs.ubc.ca/~murphyk/Software/DAGlearn/

^{3}http://www.bioss.ac.uk/staff/.adriano/comparison/comparison.html

- Abbeel P, Koller D, Ng AY. Learning factor graphs in polynomial time and sample complexity. JMLR. 2006;7:1743–1788.
- Aragon AD, Rodriguez AL, Meirelles O, Roy S, Davidson GS, Tapia PH, Allen C, Joe R, Benn D, Werner-Washburne M. Characterization of differentiated quiescent and nonquiescent cells in yeast stationary-phase cultures. Mol Biol Cell. 2008 E07–07–0666+ [PMC free article] [PubMed]
- Besag J. Efficiency of pseudolikelihood estimation for simple gaussian fields. Biometrika. 1977;64:616–618.
- Cover TM, Thomas JA. Elements of information theory. New York, NY, USA: Wiley-Interscience; 1991.
- Friedman N, Nachman I, Pe'er D. Learning bayesian network structure from massive datasets: The sparse candidate algorithm. Uncertainty in Artificial Intelligence 1999
- Heckerman D, Chickering DM, Meek C, Rounthwaite R, Kadie CM. Dependency networks for inference, collaborative filtering, and data visualization. Journal of Machine Learning Research. 2000;1:49–75.
- Hofmann R, Tresp V. NIPS '97: Proceedings of the 1997 conference on Advances in neural information processing systems. Vol. 10. Cambridge, MA, USA: MIT Press; 1998. Nonlinear markov networks for continuous variables; pp. 521–527.
- Lauritzen SL. Graphical models. New York, USA: Oxford University Press; 1996. (Oxford Statistical Science).
- Lee SI, Ganapathi V, Koller D. Efficient structure learning of markov networks using l
_{1}-regularization. In: Schölkopf B, Platt J, Hoffman T, editors. Advances in neural information processing systems. Vol. 19. Cambridge, MA: MIT Press; 2007. pp. 817–824. - Li F, Yang Y. Using modified lasso regression to learn large undirected graphs in a probabilistic framework. AAAI. 2005:801–806.
- Margolin A, Nemenman I, Basso K, Wiggins C, Stolovitzky G, Favera RD, Califano A. Aracne: An algorithm for the reconstruction of gene regulatory networks in a mammalian cellular context. BMC Bioinformatics. 2005;(Suppl 1):S7. [PMC free article] [PubMed]
- Paget RD. Doctoral dissertation. University of Queensland; 1999. Nonparametric markov random field models for natural texture images.
- Roy S, Werner-Washburne M, Lane T. A system for generating transcription regulatory networks with combinatorial control of transcription. Bioinformatics (Oxford, England) 2008 [PMC free article] [PubMed]
- Schmidt M, Niculescu-Mizil A, Murphy K. Learning graphical model structure using l1-regularization paths. Twenty second international conference on AI 2007

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |