PMCCPMCCPMCC

Search tips
Search criteria 

Advanced

 
Logo of nihpaAbout Author manuscriptsSubmit a manuscriptHHS Public Access; Author Manuscript; Accepted for publication in peer reviewed journal;
 
J Artif Intell Res. Author manuscript; available in PMC 2010 August 24.
Published in final edited form as:
J Artif Intell Res. 2010 January 1; 37: 279–328.
doi:  10.1613/jair.2842
PMCID: PMC2926991
NIHMSID: NIHMS220551

Join-Graph Propagation Algorithms

Abstract

The paper investigates parameterized approximate message-passing schemes that are based on bounded inference and are inspired by Pearl's belief propagation algorithm (BP). We start with the bounded inference mini-clustering algorithm and then move to the iterative scheme called Iterative Join-Graph Propagation (IJGP), that combines both iteration and bounded inference. Algorithm IJGP belongs to the class of Generalized Belief Propagation algorithms, a framework that allowed connections with approximate algorithms from statistical physics and is shown empirically to surpass the performance of mini-clustering and belief propagation, as well as a number of other state-of-the-art algorithms on several classes of networks. We also provide insight into the accuracy of iterative BP and IJGP by relating these algorithms to well known classes of constraint propagation schemes.

1. Introduction

Probabilistic inference is the principal task in Bayesian networks and is known to be an NP-hard problem (Cooper, 1990; Roth, 1996). Most of the commonly used exact algorithms such as join-tree clustering (Lauritzen & Spiegelhalter, 1988; Jensen, Lauritzen, & Olesen, 1990) or variable-elimination (Dechter, 1996, 1999; Zhang, Qi, & Poole, 1994), and more recently search schemes (Darwiche, 2001; Bacchus, Dalmao, & Pitassi, 2003; Dechter & Mateescu, 2007) exploit the network structure. While significant advances were made in the last decade in exact algorithms, many real-life problems are too big and too hard, especially when their structure is dense, since they are time and space exponential in the treewidth of the graph. Approximate algorithms are therefore necessary for many practical problems, although approximation within given error bounds is also NP-hard (Dagum & Luby, 1993; Roth, 1996).

The paper focuses on two classes of approximation algorithms for the task of belief updating. Both are inspired by Pearl's belief propagation algorithm (Pearl, 1988), which is known to be exact for trees. As a distributed algorithm, Pearl's belief propagation can also be applied iteratively to networks that contain cycles, yielding Iterative Belief Propagation (IBP), also known as loopy belief propagation. When the networks contain cycles, IBP is no longer guaranteed to be exact, but in many cases it provides very good approximations upon convergence. Some notable success cases are those of IBP for coding networks (McEliece, MacKay, & Cheng, 1998; McEliece & Yildirim, 2002), and a version of IBP called survey propagation for some classes of satisfiability problems (Mézard, Parisi, & Zecchina, 2002; Braunstein, Mézard, & Zecchina, 2005).

Although the performance of belief propagation is far from being well understood in general, one of the more promising avenues towards characterizing its behavior came from analogies with statistical physics. It was shown by Yedidia, Freeman, and Weiss (2000, 2001) that belief propagation can only converge to a stationary point of an approximate free energy of the system, called Bethe free energy. Moreover, the Bethe approximation is computed over pairs of variables as terms, and is therefore the simplest version of the more general Kikuchi (1951) cluster variational method, which is computed over clusters of variables. This observation inspired the class of Generalized Belief Propagation (GBP) algorithms, that work by passing messages between clusters of variables. As mentioned by Yedidia et al. (2000), there are many GBP algorithms that correspond to the same Kikuchi approximation. A version based on region graphs, called “canonical” by the authors, was presented by Yedidia et al. (2000, 2001, 2005). Our algorithm Iterative Join-Graph Propagation is a member of the GBP class, although it will not be described in the language of region graphs. Our approach is very similar to and was independently developed from that of McEliece and Yildirim (2002). For more information on BP state of the art research see the recent survey by Koller (2010).

We will first present the mini-clustering scheme which is an anytime bounded inference scheme that generalizes the mini-bucket idea. It can be viewed as a belief propagation algorithm over a tree obtained by a relaxation of the network's structure (using the technique of variable duplication). We will subsequently present Iterative Join-Graph Propagation (IJGP) that sends messages between clusters that are allowed to form a cyclic structure.

Through these two schemes we investigate: (1) the quality of bounded inference as an anytime scheme (using mini-clustering); (2) the virtues of iterating messages in belief propagation type algorithms, and the result of combining bounded inference with iterative message-passing (in IJGP).

In the background section 2, we overview the Tree-Decomposition scheme that forms the basis for the rest of the paper. By relaxing two requirements of the tree-decomposition, that of connectedness (via mini-clustering) and that of tree structure (by allowing cycles in the underlying graph), we combine bounded inference and iterative message-passing with the basic tree-decomposition scheme, as elaborated in subsequent sections.

In Section 3 we present the partitioning-based anytime algorithm called Mini-Clustering (MC), which is a generalization of the Mini-Buckets algorithm (Dechter & Rish, 2003). It is a message-passing algorithm guided by a user adjustable parameter called i-bound, offering a flexible tradeoff between accuracy and efficiency in anytime style (in general the higher the i-bound, the better the accuracy). MC algorithm operates on a tree-decomposition, and similar to Pearl's belief propagation algorithm (Pearl, 1988) it converges in two passes, up and down the tree. Our contribution beyond other works in this area (Dechter & Rish, 1997; Dechter, Kask, & Larrosa, 2001) is in: (1) Extending the partition-based approximation for belief updating from mini-buckets to general tree-decompositions, thus allowing the computation of the updated beliefs for all the variables at once. This extension is similar to the one proposed by Dechter et al. (2001), but replaces optimization with probabilistic inference. (2) Providing empirical evaluation that demonstrates the effectiveness of the idea of tree-decomposition combined with partition-based approximation for belief updating.

Section 4 introduces the Iterative Join-Graph Propagation (IJGP) algorithm. It operates on a general join-graph decomposition that may contain cycles. It also provides a user adjustable i-bound parameter that defines the maximum cluster size of the graph (and hence bounds the complexity), therefore it is both anytime and iterative. While the algorithm IBP is typically presented as a generalization of Pearl's Belief Propagation algorithm, we show that IBP can be viewed as IJGP with the smallest i-bound.

We also provide insight into IJGP's behavior in Section 4. Zero-beliefs are variable-value pairs that have zero conditional probability given the evidence. We show that: (1) if a value of a variable is assessed as having zero-belief in any iteration of IJGP, it remains a zero-belief in all subsequent iterations; (2) IJGP converges in a finite number of iterations relative to its set of zero-beliefs; and, most importantly (3) that the set of zero-beliefs decided by any of the iterative belief propagation methods is sound. Namely any zero-belief determined by IJGP corresponds to a true zero conditional probability relative to the given probability distribution expressed by the Bayesian network.

Empirical results on various classes of problems are included in Section 5, shedding light on the performance of IJGP(i). We see that it is often superior, or otherwise comparable, to other state-of-the-art algorithms.

The paper is based in part on earlier conference papers by Dechter, Kask, and Mateescu (2002), Mateescu, Dechter, and Kask (2002) and Dechter and Mateescu (2003).

2. Background

In this section we provide background for exact and approximate probabilistic inference algorithms that form the basis of our work. While we present our algorithms in the context of directed probabilistic networks, they are applicable to any graphical model, including Markov networks.

2.1 Preliminaries

Notations: A reasoning problem is defined in terms of a set of variables taking values on finite domains and a set of functions defined over these variables. We denote variables or subsets of variables by uppercase letters (e.g., X, Y, Z, S, R. . .) and values of variables by lower case letters (e.g., x, y, z, s). An assignment (X1 = x1,. . ., Xn = xn) can be abbreviated as x = (x1,. . ., xn). For a subset of variables S, DS denotes the Cartesian product of the domains of variables in S. xS is the projection of x = (x1,. . ., xn) over a subset S. We denote functions by letters f, g, h, etc., and the scope (set of arguments) of the function f by scope(f).

Definition 1 (graphical model) (Kask, Dechter, Larrosa, & Dechter, 2005) A graphical model M is a 3-tuple, M=X,D,F, where: X = {X1,. . ., Xn} is a finite set of variables; D = {D1,. . ., Dn} is the set of their respective finite domains of values; F = {f1,. . ., fr} is a set of positive real-valued discrete functions, each defined over a subset of variables SiX, called its scope, and denoted by scope (fi). A graphical model typically has an associated combination operator 1 [multiply sign in circle], (e.g., {Π,} - product, sum). The graphical model represents the combination of all its functions: i=1rfi. A graphical model has an associated primal graph that captures the structural information of the model:

Definition 2 (primal graph, dual graph) The primal graph of a graphical model is an undirected graph that has variables as its vertices and an edge connects any two vertices whose corresponding variables appear in the scope of the same function. A dual graph of a graphical model has a one-to-one mapping between its vertices and functions of the graphical model. Two vertices in the dual graph are connected if the corresponding functions in the graphical model share a variable. We denote the primal graph by G = (X, E), where X is the set of variables and E is the set of edges.

Definition 3 (belief networks) A belief (or Bayesian) network is a graphical model B=X,D,G,P, where G = (X, E) is a directed acyclic graph over variables X and P = {pi}, where pi = {p(Xi | pa (Xi))} are conditional probability tables (CPTs) associated with each variable Xi and pa(Xi) = scope(pi)−{Xi} is the set of parents of Xi in G. Given a subset of variables S, we will write P(s) as the probability P(S = s), where s [set membership] DS. A belief network represents a probability distribution over X, P(x1,,xn)=Πi=1nP(xixpa(Xi)). An evidence set e is an instantiated subset of variables. The primal graph of a belief network is called a moral graph. It can be obtained by connecting the parents of each vertex in G and removing the directionality of the edges. Equivalently, it connects any two variables appearing in the same family (a variable and its parents in the CPT).

Two common queries in Bayesian networks are Belief Updating (BU) and Most Probable Explanation (MPE).

Definition 4 (belief network queries) The Belief Updating (BU) task is to find the posterior probability of each single variable given some evidence e, that is to compute P(Xi|e). The Most Probable Explanation (MPE) task is to find a complete assignment to all the variables having maximum probability given the evidence, that is to compute argmaxX[product]ipi.

2.2 Tree-Decomposition Schemes

Tree-decomposition is at the heart of most general schemes for solving a wide range of automated reasoning problems, such as constraint satisfaction and probabilistic inference. It is the basis for many well-known algorithms, such as join-tree clustering and bucket elimination. In our presentation we will follow the terminology of Gottlob, Leone, and Scarcello (2000) and Kask et al. (2005).

Definition 5 (tree-decomposition, cluster-tree) Let B=X,D,G,P be a belief network. A tree-decomposition for B is a triple left angle bracketT, χ, ψright angle bracket, where T = (V, E) is a tree, and χ and ψ are labeling functions which associate with each vertex v [set membership] V two sets, χ(v) [subset, dbl equals] X and ψ(v) [subset, dbl equals] P satisfying:

  1. For each function pi [set membership] P, there is exactly one vertex v [set membership] V such that pi [set membership] ψ(v), and scope(pi) [subset, dbl equals]χ(v).
  2. For each variable Xi [set membership] X, the set {v [set membership] V|Xi [set membership] χ(v)} induces a connected subtree of T. This is also called the running intersection (or connectedness) property.

We will often refer to a node and its functions as a cluster and use the term tree-decomposition and cluster-tree interchangeably.

Definition 6 (treewidth, separator, eliminator) Let D = left angle bracketT, χ, ψright angle bracket be a tree-decomposition of a belief network B. The treewidth (Arnborg, 1985) of D is maxv[set membership]V|χ(v) − 1. The treewidth of B is the minimum treewidth over all its tree-decompositions. Given two adjacent vertices u and v of a tree-decomposition, the separator of u and v is defined as sep(u,v)=χ(u)χ(v), and the eliminator of u with respect to v is elim(u, v) = χ(u) − χ(v). The separator-width of D is max(u,v)|sep(u, v)|. The minimum treewidth of a graph G can be shown to be identical to a related parameter called induced-width (Dechter & Pearl, 1987).

Join-tree and cluster-tree elimination (CTE) In both Bayesian network and constraint satisfaction communities, the most used tree-decomposition method is join-tree decomposition (Lauritzen & Spiegelhalter, 1988; Dechter & Pearl, 1989), introduced based on relational database concepts (Maier, 1983). Such decompositions can be generated by embedding the network's moral graph G into a chordal graph, often using a triangulation algorithm and using its maximal cliques as nodes in the join-tree. The triangulation algorithm assembles a join-tree by connecting the maximal cliques in the chordal graph in a tree. Subsequently, every CPT pi is placed in one clique containing its scope. Using the previous terminology, a join-tree decomposition of a belief network B=X,D,G,P is a tree T = (V, E), where V is the set of cliques of a chordal graph G′ that contains G, and E is a set of edges that form a tree between cliques, satisfying the running intersection property (Maier, 1983). Such a join-tree satisfies the properties of tree-decomposition and is therefore a cluster-tree (Kask et al., 2005). In this paper, we will use the terms tree-decomposition and join-tree decomposition interchangeably.

There are a few variants for processing join-trees for belief updating (e.g., Jensen et al., 1990; Shafer & Shenoy, 1990). We adopt here the version from Kask et al. (2005), called cluster-tree-elimination (CTE), that is applicable to tree-decompositions in general and is geared towards space savings. It is a message-passing algorithm; for the task of belief updating, messages are computed by summation over the eliminator between the two clusters of the product of functions in the originating cluster. The algorithm, denoted CTE-BU (see Figure 1), pays a special attention to the processing of observed variables since the presence of evidence is a central component in belief updating. When a cluster sends a message to a neighbor, the algorithm operates on all the functions in the cluster except the message from that particular neighbor. The message contains a single combined function and individual functions that do not share variables with the relevant eliminator. All the non-individual functions are combined in a product and summed over the eliminator.

Figure 1
Algorithm Cluster-Tree-Elimination for Belief Updating (CTE-BU).

Example 1 Figure 2a describes a belief network and Figure 2b a join-tree decomposition for it. Figure 2c shows the trace of running CTE-BU with evidence G = ge, where h(u,v) is a message that cluster u sends to cluster v.

Figure 2
(a) A belief network; (b) A join-tree decomposition; (c) Execution of CTE-BU.

Theorem 1 (complexity of CTE-BU) (Dechter et al., 2001; Kask et al., 2005) Given a Bayesian network B=X,D,G,P and a tree-decomposition left angle bracketT, χ, ψright angle bracket of B, the time complexity of CTEBU is O(deg · (n + N) · dw*+1) and the space complexity is O(N · dsep), where deg is the maximum degree of a node in the tree-decomposition, n is the number of variables, N is the number of nodes in the tree-decomposition, d is the maximum domain size of a variable, w* is the treewidth and sep is the maximum separator size.

3. Partition-Based Mini-Clustering

The time, and especially the space complexity, of CTE-BU renders the algorithm infeasible for problems with large treewidth. We now introduce Mini-Clustering, a partition-based anytime algorithm which computes bounds or approximate values on P(Xi, e) for every variable Xi.

3.1 Mini-Clustering Algorithm

Combining all the functions of a cluster into a product has a complexity exponential in its number of variables, which is upper bounded by the induced width. Similar to the mini-bucket scheme (Dechter, 1999), rather than performing this expensive exact computation, we partition the cluster into p mini-clusters mc(1),. . ., mc(p), each having at most i variables, where i is an accuracy parameter. Instead of computing by CTE-BU h(u,v)=elim(u,v)fψ(u)f, we can divide the functions of ψ(u) into p mini-clusters mc(k), k [set membership] {1,. . ., p}, and rewrite h(u,v)=elim(u,v)fψ(u)f=elim(u,v)k=1pfmc(k)f. By migrating the summation operator into each mini-cluster, yielding k=1pelim(u,v)fmc(k)f, we get an upper bound on h(u,v). The resulting algorithm is called MC-BU(i).

Consequently, the combined functions are approximated via mini-clusters, as follows. Suppose u [set membership] V has received messages from all its neighbors other than v (the message from v is ignored even if received). The functions in clusterv(u) that are to be combined are partitioned into mini-clusters {mc(1),. . ., mc(p)}, each one containing at most i variables. Each mini-cluster is processed by summation over the eliminator, and the resulting combined functions as well as all the individual functions are sent to v. It was shown by Dechter and Rish (2003) that the upper bound can be improved by using the maximization operator max rather than the summation operator sum on some mini-buckets. Similarly, lower bounds can be generated by replacing sum with min (minimization) for some mini-buckets. Alternatively, we can replace sum by a mean operator (taking the sum and dividing by the number of elements in the sum), in this case deriving an approximation of the joint belief instead of a strict upper bound.

Algorithm MC-BU for upper bounds can be obtained from CTE-BU by replacing step 2 of the main loop and the final part of computing the upper bounds on the joint belief by the procedure given in Figure 3. In the implementation we used for the experiments reported here, the partitioning was done in a greedy brute-force manner. We ordered the functions according to their sizes (number of variables), breaking ties arbitrarily. The largest function was placed in a mini-cluster by itself. Then, we picked the largest remaining function and probed the mini-clusters in the order of their creation, trying to find one that together with the new function would have no more than i variables. A new mini-cluster was created whenever the existing ones could not accommodate the new function.

Figure 3
Procedure Mini-Clustering for Belief Updating (MC-BU).

Example 2 Figure 4 shows the trace of running MC-BU(3) on the problem in Figure 2. First, evidence G = ge is assigned in all CPTs. There are no individual functions to be sent from cluster 1 to cluster 2. Cluster 1 contains only 3 variables, χ(1) = {A, B, C}, therefore it is not partitioned. The combined function h(1,2)1(b,c)=ap(a)p(ba)p(ca,b) is computed and the message H(1,2)={h(1,2)1(b,c)} is sent to node 2. Now, node 2 can send its message to node 3. Again, there are no individual functions. Cluster 2 contains 4 variables, χ(2) = {B, C, D, F}, and a partitioning is necessary: MC-BU(3) can choose mc(1) = {p(d|b), h(1,2)(b, c)} and md(2) = {p(f|c, d)}. The combined functions h(2,3)1(b)=c,dp(db)h(1,2)(b,c) and h(2,3)2(f)=maxc,dp(fc,d) are computed and the message H(2,3)={h(2,3)1(b),h(2,3)2(f)} is sent to node 3. The algorithm continues until every node has received messages from all its neighbors. An upper bound on p(a, G = ge) can now be computed by choosing cluster 1, which contains variable A. It doesn't need partitioning, so the algorithm just computes b,cp(a)p(ba)p(ca,b)h(2,1)1(b)h(2,1)2(c). Notice that unlike CTE-BU which processes 4 variables in cluster 2, MC-BU(3) never processes more than 3 variables at a time.

Figure 4
Execution of MC-BU for i = 3.

It was already shown that:

Theorem 2 (Dechter & Rish, 2003) Given a Bayesian network B=X,D,G,P and the evidence e, the algorithm MC-BU(i) computes an upper bound on the joint probability P(Xi, e) of each variable Xi (and each of its values) and the evidence e.

Theorem 3 (complexity of MC-BU(i)) (Dechter et al., 2001) Given a Bayesian network B=X,D,G,P and a tree-decomposition left angle bracketT, χ, ψright angle bracket of B, the time and space complexity of MC-BU(i) is O(n · hw* · di), where n is the number of variables, d is the maximum domain size of a variable and hw=maxuT{fPscope(f)χ(u)ϕ}, which bounds the number of mini-clusters.

Semantics of Mini-Clustering The mini-bucket scheme was shown to have the semantics of relaxation via node duplication (Kask & Dechter, 2001; Choi, Chavira, & Darwiche, 2007). We extend it to mini-clustering by showing how it can apply as is to messages that flow in one direction (inward, from leaves to root), as follows. Given a tree-decomposition D, where CTE-BU computes a function h(u,v) (the message that cluster u sends to cluster v), MC-BU(i) partitions cluster u into p mini-clusters u1,. . ., up, which are processed independently and then the resulting functions h(ui,v) are sent to v. Instead consider a different decomposition D′, which is just like D, with the exception that (a) instead of u, it has clusters u1,. . ., up, all of which are children of v, and each variable appearing in more than a single mini-cluster becomes a new variable, (b) each child w of u (in D) is a child of uk (in D′), such that h(w,u) (in D) is assigned to uk (in D') during the partitioning. Note that D′ is not a legal tree-decomposition relative to the original variables since it violates the connectedness property: the mini-clusters u1,. . ., up contain variables elim(u, v) but the path between the nodes u1,. . ., up (this path goes through v) does not. However, it is a legal tree-decomposition relative to the new variables. It is straightforward to see that H(u,v) computed by MC-BU(i) on D is the same as {h(ui,v)|i = 1,. . ., p} computed by CTE-BU on D′ in the direction from leaves to root.

If we want to capture the semantics of the outward messages from root to leaves, we need to generate a different relaxed decomposition (D″) because MC, as defined, allows a different partitioning in the up and down streams of the same cluster. We could of course stick with the decomposition in D′ and use CTE in both directions which would lead to another variant of mini-clustering.

Example 3 Figure 5(a) shows a trace of the bottom-up phase of MC-BU(3) on the network in Figure 4. Figure 5(b) shows a trace of the bottom-up phase of CTE-BU algorithm on a problem obtained from the problem in Figure 4 by splitting nodes D (into D′ and D″) and F (into F′ and F″).

Figure 5
Node duplication semantics of MC: (a) trace of MC-BU(3); (b) trace of CTE-BU.

The MC-BU algorithm computes an upper bound P(Xi, e) on the joint probability P(Xi, e). However, deriving a bound on the conditional probability P(Xi|e) is not easy when the exact value of P(e) is not available. If we just try to divide (multiply) P(Xi, e) by a constant, the result is not necessarily an upper bound on P(Xi|e). It is easy to show that normalization, P(xi,e)xiDiP(xi,e), with the mean operator is identical to normalization of MC-BU output when applying the summation operator in all the mini-clusters.

MC-BU(i) is an improvement over the Mini-Bucket algorithm MB(i), in that it allows the computation of P(Xi, e) for all variables with a single run, whereas MB(i) computes P(Xi, e) for just one variable, with a single run. When computing P(Xi, e) for each variable, MB(i) has to be run n times, once for each variable, an algorithm we call nMB(i). It was demonstrated by Mateescu et al. (2002) that MC-BU(i) has up to linear speed-up over nMB(i). For a given i, the accuracy of MC-BU(i) can be shown to be not worse than that of nMB(i).

3.2 Experimental Evaluation of Mini-Clustering

The work of Mateescu et al. (2002) and Kask (2001) provides an empirical evaluation of MC-BU that reveals the impact of the accuracy parameter on its quality of approximation and compares with Iterative Belief Propagation and a Gibbs sampling scheme. We will include here only a subset of these experiments which will provide the essence of our results. Additional empirical evaluation of MC-BU will be given when comparing against IJGP later in this paper.

We tested the performance of MC-BU(i) on random Noisy-OR networks, random coding networks, general random networks, grid networks, and three benchmark CPCS files with 54, 360 and 422 variables respectively (these are belief networks for medicine, derived from the Computer based Patient Case Simulation system, known to be hard for belief updating). On each type of network we ran Iterative Belief Propagation (IBP) - set to run at most 30 iterations, Gibbs Sampling (GS) and MC-BU(i), with i from 2 to the treewidth w* to capture the anytime behavior of MC-BU(i).

The random networks were generated using parameters (N,K,C,P), where N is the number of variables, K is their domain size (we used only K=2), C is the number of conditional probability tables and P is the number of parents in each conditional probability table. The parents in each table are picked randomly given a topological ordering, and the conditional probability tables are filled randomly. The grid networks have the structure of a square, with edges directed to form a diagonal flow (all parallel edges have the same direction). They were generated by specifying N (a square integer) and K (we used K=2). We also varied the number of evidence nodes, denoted by |e| in the tables. The parameter values are reported in each table. For all the problems, Gibbs sampling performed consistently poorly so we only include part of its results here.

In our experiments we focused on the approximation power of MC-BU(i). We compared two versions of the algorithm. In the first version, for every cluster, we used the max operator in all its mini-clusters, except for one of them that was processed by summation. In the second version, we used the operator mean in all the mini-clusters. We investigated this second version of the algorithm for two reasons: (1) we compare MC-BU(i) with IBP and Gibbs sampling, both of which are also approximation algorithms, so it would not be possible to compare with a bounding scheme; (2) we observed in our experiments that, although the bounds improve as the i-bound increases, the quality of bounds computed by MC-BU(i) was still poor, with upper bounds being greater than 1 in many cases.2 Notice that we need to maintain the sum operator for at least one of the mini-clusters. The mean operator simply performs summation and divides by the number of elements in the sum. For example, if A, B, C are binary variables (taking values 0 and 1), and f(A, B, C) is the aggregated function of one mini-cluster, and elim = {A, B}, then computing the message h(C) by the mean operator gives: 14A,B{0,1}f(A,B,C).

We computed the exact solution and used three different measures of accuracy: 1) Normalized Hamming Distance (NHD) - we picked the most likely value for each variable for the approximate and for the exact, took the ratio between the number of disagreements and the total number of variables, and averaged over the number of problems that we ran for each class; 2) Absolute Error (Abs. Error) - is the absolute value of the difference between the approximate and the exact, averaged over all values (for each variable), all variables and all problems; 3) Relative Error (Rel. Error) - is the absolute value of the difference between the approximate and the exact, divided by the exact, averaged over all values (for each variable), all variables and all problems. For coding networks, we report only one measure, Bit Error Rate (BER). In terms of the measures defined above, BER is the normalized Hamming distance between the approximate (computed by an algorithm) and the actual input (which in the case of coding networks may be different from the solution given by exact algorithms), so we denote them differently to make this semantic distinction. We also report the time taken by each algorithm. For reported metrics (time, error, etc.) provided in the Tables, we give both mean and max values.

In Figure 6 we show that IBP converges after about 5 iterations. So, while in our experiments we report its time for 30 iterations, its time is even better when sophisticated termination is used. These results are typical of all runs.

Figure 6
Convergence of IBP (50 variables, evidence from 0-30 variables).

Random Noisy-OR networks results

are summarized in Tables 1 and and2,2, and Figure 7. For NHD, both IBP and MC-BU gave perfect results. For the other measures, we noticed that IBP is more accurate when there is no evidence by about an order of magnitude. However, as evidence is added, IBP's accuracy decreases, while MC-BU's increases and they give similar results. We see that MC-BU gets better as the accuracy parameter i increases, which shows its anytime behavior.

Figure 7
Absolute error for Noisy-OR networks.
Table 1
Performance on Noisy-OR networks, w* = 10: Normalized Hamming Distance, absolute error, relative error and time.
Table 2
Performance on Noisy-OR networks, w* = 16: Normalized Hamming Distance, absolute error, relative error and time.

General random networks results

are summarized in Figure 8. They are similar to those for random Noisy-OR networks. Again, IBP has the best result only when the number of evidence variables is small. It is remarkable how quickly MC-BU surpasses the performance of IBP as evidence is added (for more, see the results of Mateescu et al., 2002).

Figure 8
Absolute error for random networks.

Random coding networks results

are given in Table 3 and Figure 9. The instances fall within the class of linear block codes, (σ is the channel noise level). It is known that IBP is very accurate for this class. Indeed, these are the only problems that we experimented with where IBP outperformed MC-BU throughout. The anytime behavior of MC-BU can again be seen in the variation of numbers in each column and more vividly in Figure 9.

Figure 9
Bit Error Rate (BER) for coding networks.
Table 3
Bit Error Rate (BER) for coding networks.

Grid networks results

are given in Figure 10. We notice that IBP is more accurate for no evidence and MC-BU is better as more evidence is added. The same behavior was consistently manifested for smaller grid networks that we experimented with (from 7×7 up to 14×14).

Figure 10
Grid 15×15: absolute error and time.

CPCS networks results

We also tested on three CPCS benchmark files. The results are given in Figure 11. It is interesting to notice that the MC-BU scheme scales up to fairly large networks, like the real life example of CPCS422 (induced width 23). IBP is again more accurate when there is no evidence, but is surpassed by MC-BU when evidence is added. However, whereas MC-BU is competitive with IBP time-wise when i-bound is small, its runtime grows rapidly as i-bound increases. For more details on all these benchmarks see the results of Mateescu et al. (2002).

Figure 11
Absolute error for CPCS422.

Summary

Our results show that, as expected, IBP is superior to all other approximations for coding networks. However, for random Noisy-OR, general random, grid networks and the CPCS networks, in the presence of evidence, the mini-clustering scheme is often superior even in its weakest form. The empirical results are particularly encouraging as we use an un-optimized scheme that exploits a universal principle applicable to many reasoning tasks.

4. Join-Graph Decomposition and Propagation

In this section we introduce algorithm Iterative Join-Graph Propagation (IJGP) which, like mini-clustering, is designed to benefit from bounded inference, but also exploit iterative message-passing as used by IBP. Algorithm IJGP can be viewed as an iterative version of mini-clustering, improving the quality of approximation, especially for low i-bounds. Given a cluster of the decomposition, mini-clustering can potentially create a different partitioning for every message sent to a neighbor. This dynamic partitioning can happen because the incoming message from each neighbor has to be excluded when realizing the partitioning, so a different set of functions are split into mini-clusters for every message to a neighbor. We can define a version of mini-clustering where for every cluster we create a unique static partition into mini-clusters such that every incoming message can be included into one of the mini-clusters. This version of MC can be extended into IJGP by introducing some links between mini-clusters of the same cluster, and carefully limiting the interaction between the resulting nodes in order to eliminate over-counting.

Algorithm IJGP works on a general join-graph that may contain cycles. The cluster size of the graph is user adjustable via the i-bound (providing the anytime nature), and the cycles in the graph allow the iterative application of message-passing. In Subsection 4.1 we introduce join-graphs and discuss their properties. In Subsection 4.2 we describe the IJGP algorithm itself.

4.1 Join-Graphs

Definition 7 (join-graph decomposition) A join-graph decomposition for a belief network B=X,D,G,P is a triple D = left angle bracketJG, χ, ψright angle bracket, where JG = (V, E) is a graph, and χ and ψ are labeling functions which associate with each vertex v [set membership] V two sets, χ(v) [subset, dbl equals] X and ψ(v) [subset, dbl equals] P such that:

  1. For each pi [set membership] P, there is exactly one vertex v [set membership] V such that pi [set membership] ψ(v), and scope(pi) [subset, dbl equals] χ(v).
  2. (connectedness) For each variable Xi [set membership] X, the set {v [set membership] V|Xi [set membership] χ(v)} induces a connected subgraph of JG. The connectedness requirement is also called the running intersection property.

We will often refer to a node in V and its CPT functions as a cluster3 and use the term join-graph decomposition and cluster-graph interchangeably. Clearly, a join-tree decomposition or a cluster-tree is the special case when the join-graph D is a tree.

It is clear that one of the problems of message propagation over cyclic join-graphs is over-counting. To reduce this problem we devise a scheme, which avoids cyclicity with respect to any single variable. The algorithm works on edge-labeled join-graphs.

Definition 8 (minimal edge-labeled join-graph decompositions) An edge-labeled join-graph decomposition for B=X,D,G,P is a four-tuple D = left angle bracketJG, χ, ψ, θright angle bracket, where JG = (V, E) is a graph, χ and ψ associate with each vertex v [set membership] V the sets χ(v) [subset, dbl equals] X and ψ(v) [subset, dbl equals] P and θ associates with each edge (v,u)E the set θ((v, u)) [subset, dbl equals] X such that:

  1. For each function pi [set membership] P, there is exactly one vertex v [set membership] V such that pi [set membership] ψ(v), and scope(pi) [subset, dbl equals] χ(v).
  2. (edge-connectedness) For each edge (u,v),θ((u,v))χ(u)χ(v), such that ∀Xi [set membership] X, any two clusters containing Xi can be connected by a path whose every edge label includes Xi.

Finally, an edge-labeled join-graph is minimal if no variable can be deleted from any label while still satisfying the edge-connectedness property.

Definition 9 (separator, eliminator of edge-labeled join-graphs) Given two adjacent vertices u and v of JG, the separator of u and v is defined as sep(u, v) = θ((u, v)), and the eliminator of u with respect to v is elim(u, v) = χ(u) − θ((u, v)). The separator width is max(u,v)|sep(u, v)|.

Edge-labeled join-graphs can be made label minimal by deleting variables from the labels while maintaining connectedness (if an edge label becomes empty, the edge can be deleted). It is easy to see that,

Proposition 1 A minimal edge-labeled join-graph does not contain any cycle relative to any single variable. That is, any two clusters containing the same variable are connected by exactly one path labeled with that variable.

Notice that every minimal edge-labeled join-graph is edge-minimal (no edge can be deleted), but not vice-versa.

Example 4 The example in Figure 12a shows an edge minimal join-graph which contains a cycle relative to variable 4, with edges labeled with separators. Notice however that if we remove variable 4 from the label of one edge we will have no cycles (relative to single variables) while the connectedness property is still maintained.

Figure 12
An edge-labeled decomposition.

The mini-clustering approximation presented in the previous section works by relaxing the join-tree requirement of exact inference into a collection of join-trees having smaller cluster size. It introduces some independencies in the original problem via node duplication and applies exact inference on the relaxed model requiring only two message passings. For the class of IJGP algorithms we take a different route. We choose to relax the tree-structure requirement and use join-graphs which do not introduce any new independencies, and apply iterative message-passing on the resulting cyclic structure.

Indeed, it can be shown that any join-graph of a belief network is an I-map (independency map, Pearl, 1988) of the underlying probability distribution relative to node-separation. Since we plan to use minimally edge-labeled join-graphs to address over-counting problems, the question is what kind of independencies are captured by such graphs.

Definition 10 (edge-separation in edge-labeled join-graphs) Let D = left angle bracketJG, χ, ψ, θright angle bracket, JG = (V, E) be an edge-labeled decomposition of a Bayesian network B=X,D,G,P. Let NW, NY [subset, dbl equals] V be two sets of nodes, and EZ [subset, dbl equals] E be a set of edges in JG. Let W, Y, Z be their corresponding sets of variables (W=vNWχ(v),Z=eEZθ(e)). We say that EZ edge-separates NW and NY in D if there is no path between NW and NY in the JG graph whose edges in EZ are removed. In this case we also say that W is separated from Y given Z in D, and write left angle bracketW|Z|Yright angle bracketD. Edge-separation in a regular join-graph is defined relative to its separators.

Theorem 4 Any edge-labeled join-graph decomposition D = left angle bracketJG, χ, ψ, θright angle bracket of belief network B=X,D,G,P is an I-map of P relative to edge-separation. Namely, any edge separation in D corresponds to conditional independence in P.

Proof: Let MG be the moral graph of BN. Since MG is an I-map of P, it is enough to prove that JG is an I-map of MG. Let NW and NY be disjoint sets of nodes and NZ be a set of edges in JG, and W, Z, Y be their corresponding sets of variables in MG. We will prove:

NWNZNYJGWZYMG

by contradiction. Since the sets W, Z, Y may not be disjoint, we will actually prove that left angle bracketWZ|Z|YZright angle bracketMG holds, this being equivalent to left angle bracketW|Z|Yright angle bracketMG.

Supposing left angle bracketWZ|Z|YZright angle bracketMG is false, then there exists a path α = γ1, γ2,. . ., γn−1, β = γn in MG that goes from some variable α = γ1 [set membership] WZ to some variable β = γn [set membership] YZ without intersecting variables in Z. Let Nv be the set of all nodes in JG that contain variable v [set membership] X, and let us consider the set of nodes:

S=i=1nNγiNZ

We argue that S forms a connected sub-graph in JG. First, the running intersection property ensures that every i, i = 1,. . ., n, remains connected in JG after removing the nodes in NZ (otherwise, it must be that there was a path between the two disconnected parts in the original JG, which implies that a γi is part of Z, which is a contradiction). Second, the fact that (γi, γi+1), i = 1,. . ., n − 1, is an edge in the moral graph MG implies that there is a conditional probability table (CPT) on both γi and γi+1, i = 1,. . ., n − 1 (and perhaps other variables). From property 1 of the definition of the join-graph, it follows that for all i = 1,. . ., n − 1 there exists a node in JG that contains both γi and γi+1. This proves the existence of a path in the mutilated join-graph (JG with NZ pulled out) from a node in NW containing α = γ1 to the node containing both γ1 and γ2(Nγ1 is connected), then from that node to the one containing both γ2 and γ3(Nγ2 is connected), and so on until we reach a node in NY containing α = γn. This shows that left angle bracketNW|NZ|NYright angle bracketJG is false, concluding the proof by contradiction. □

Interestingly however, deleting variables from edge labels or removing edges from edge-labeled join-graphs whose clusters are fixed will not increase the independencies captured by edge-labeled join-graphs. That is,

Proposition 2 Any two (edge-labeled) join-graphs defined on the same set of clusters, sharing (V , χ, ψ), express exactly the same set of independencies relative to edge-separation, and this set of independencies is identical to the one expressed by node separation in the primal graph of the join-graph.

Proof: This follows by looking at the primal graph of the join-graph (obtained by connecting any two nodes in a cluster by an arc over the original variables as nodes) and observing that any edge-separation in a join-graph corresponds to a node separation in the primal graph and vice-versa. □

Hence, the issue of minimizing computational over-counting due to cycles appears to be unrelated to the problem of maximizing independencies via minimal I-mapness. Nevertheless, to avoid over-counting as much as possible, we still prefer join-graphs that minimize cycles relative to each variable. That is, we prefer minimal edge-labeled join-graphs.

Relationship with region graphs

There is a strong relationship between our join-graphs and the region graphs of Yedidia et al. (2000, 2001, 2005). Their approach was inspired by advances in statistical physics, when it was realized that computing the partition function is essentially the same combinatorial problem that expresses probabilistic reasoning. As a result, variational methods from physics could have counterparts in reasoning algorithms. It was proved by Yedidia et al. (2000, 2001) that belief propagation on loopy networks can only converge (when it does so) to stationary points of the Bethe free energy. The Bethe approximation is only the simplest case of the more general Kikuchi (1951) cluster variational method. The idea is to group the variables together in clusters and perform exact computation in each cluster. One key question is then how to aggregate the results, and how to account for the variables that are shared between clusters. Again, the idea that everything should be counted exactly once is very important. This led to the proposal of region graphs (Yedidia et al., 2001, 2005) and the associated counting numbers for regions. They are given as a possible canonical version of graphs that can support Generalized Belief Propagation (GBP) algorithms. The join-graphs accomplish the same thing. The edge-labeled join-graphs can be described as region graphs where the regions are the clusters and the labels on the edges. The tree-ness condition with respect to every variable ensures that there is no over-counting.

A very similar approach to ours, which is also based on join-graphs appeared independently by McEliece and Yildirim (2002), and it is based on an information theoretic perspective.

4.2 Algorithm IJGP

Applying CTE iteratively to minimal edge-labeled join-graphs yields our algorithm Iterative Join-Graph Propagation (IJGP) described in Figure 13. One iteration of the algorithm applies message-passing in a topological order over the join-graph, forward and back. When node u sends a message (or messages) to a neighbor node v it operates on all the CPTs in its cluster and on all the messages sent from its neighbors excluding the ones received from v. First, all individual functions that share no variables with the eliminator are collected and sent to v. All the rest of the functions are combined in a product and summed over the eliminator between u and v.

Figure 13
Algorithm Iterative Join-Graph Propagation (IJGP).

Based on the results by Lauritzen and Spiegelhalter (1988) and Larrosa, Kask, and Dechter (2001) it can be shown that:

Theorem 5

  1. If IJGP is applied to a join-tree decomposition it reduces to join-tree clustering, and therefore it is guaranteed to compute the exact beliefs in one iteration.
  2. The time complexity of one iteration of IJGP is O(deg · (n + N) · dw*+1) and its space complexity is O(N · dθ), where deg is the maximum degree of a node in the join-graph, n is the number of variables, N is the number of nodes in the graph decomposition, d is the maximum domain size, w* is the maximum cluster size and is the maximum label size.

For proof, see the properties of CTE presented by Kask et al. (2005).

The special case of Iterative Belief Propagation

Iterative belief propagation (IBP) is an iterative application of Pearl's algorithm that was defined for poly-trees (Pearl, 1988), to any Bayesian network. We will describe IBP as an instance of join-graph propagation over a dual join-graph.

Definition 11 (dual graphs, dual join-graphs) Given a set of functions F = {f1,. . ., fl} over scopes S1,. . ., Sl, the dual graph of F is a graph DG = (V, E, L) that associates a node with each function, namely V = F and an edge connects any two nodes whose function's scope share a variable, E={(fi,fj)SiSjϕ}. L is a set of labels for the edges, each edge being labeled by the shared variables of its nodes, L={lij=SiSj(i,j)E}. A dual join-graph is an edge-labeled edge subgraph of DG that satisfies the connectedness property. A minimal dual join-graph is a dual join-graph for which none of the edge labels can be further reduced while maintaining the connectedness property.

Interestingly, there may be many minimal dual join-graphs of the same dual graph. We will define Iterative Belief Propagation on any dual join-graph. Each node sends a message over an edge whose scope is identical to the label on that edge. Since Pearl's algorithm sends messages whose scopes are singleton variables only, we highlight minimal singleton-label dual join-graphs.

Proposition 3 Any Bayesian network has a minimal dual join-graph where each edge is labeled by a single variable.

Proof: Consider a topological ordering of the nodes in the acyclic directed graph of the Bayesian network d = X1,. . ., Xn. We define the following dual join-graph. Every node in the dual graph D, associated with pi is connected to node pj, j < i if Xj [set membership] pa(Xi). We label the edge between pj and pi by variable Xj, namely lij = {Xj}. It is easy to see that the resulting edge-labeled subgraph of the dual graph satisfies connectedness. (Take the original acyclic graph G and add to each node its CPT family, namely all the other parents that precede it in the ordering. Since G already satisfies connectedness so is the minimal graph generated.) The resulting labeled graph is a dual graph with singleton labels. □

Example 5 Consider the belief network on 3 variables A, B, C with CPTs 1.P(C|A, B), 2.P(B|A) and 3.P(A), given in Figure 14a. Figure 14b shows a dual graph with singleton | labels on the edges. Figure 14c shows a dual graph which is a join-tree, on which belief propagation can solve the problem exactly in one iteration (two passes up and down the tree).

Figure 14
a) A belief network; b) A dual join-graph with singleton labels; c) A dual join-graph which is a join-tree.

For completeness, we present algorithm IBP, which is a special case of IJGP, in Figure 15. It is easy to see that one iteration of IBP is time and space linear in the size of the belief network. It can be shown that when IBP is applied to a minimal singleton-labeled dual graph it coincides with Pearl's belief propagation applied directly to the acyclic graph representation. Also, when the dual join-graph is a tree IBP converges after one iteration (two passes, up and down the tree) to the exact beliefs.

Figure 15
Algorithm Iterative Belief Propagation (IBP).

4.3 Bounded Join-Graph Decompositions

Since we want to control the complexity of join-graph algorithms, we will define it on decompositions having bounded cluster size. If the number of variables in a cluster is bounded by i, the time and space complexity of processing one cluster is exponential in i. Given a join-graph decomposition D = left angle bracketJG, χ, ψ, θright angle bracket, the accuracy and complexity of the (iterative) join-graph propagation algorithm depends on two different width parameters, defined next.

Definition 12 (external and internal widths) Given an edge-labeled join-graph decomposition D = left angle bracketJG, χ, ψ, θright angle bracket of a network B=X,D,G,P, the internal width of D is maxv[set membership]V|χ(v)|, while the external width of D is the treewidth of JG as a graph.

Using this terminology we can now state our target decomposition more clearly. Given a graph G, and a bounding parameter i we wish to find a join-graph decomposition D of G whose internal width is bounded by i and whose external width is minimized. The bound i controls the complexity of join-graph processing while the external width provides some measure of its accuracy and speed of convergence, because it measures how close the join-graph is to a join-tree.

We can consider two classes of algorithms. One class is partition-based. It starts from a given tree-decomposition and then partitions the clusters until the decomposition has clusters bounded by i. An alternative approach is grouping-based. It starts from a minimal dual-graph-based join-graph decomposition (where each cluster contains a single CPT) and groups clusters into larger clusters as long as the resulting clusters do not exceed the given bound. In both methods one should attempt to reduce the external width of the generated graph-decomposition. Our partition-based approach inspired by the mini-bucket idea (Dechter & Rish, 1997) is as follows.

Given a bound i, algorithm Join-Graph Structuring(i) applies the procedure Schematic Mini-Bucket(i), described in Figure 17. The procedure only traces the scopes of the functions that would be generated by the full mini-bucket procedure, avoiding actual computation. The procedure ends with a collection of mini-bucket trees, each rooted in the mini-bucket of the first variable. Each of these trees is minimally edge-labeled. Then, in-edges labeled with only one variable are introduced, and they are added only to obtain the running intersection property between branches of these trees.

Figure 17
Procedure Schematic Mini-Bucket(i).

Proposition 4 Algorithm Join-Graph Structuring(i) generates a minimal edge-labeled join-graph decomposition having bound i.

Proof: The construction of the join-graph specifies the vertices and edges of the join-graph, as well as the variable and function labels of each vertex. We need to demonstrate that 1) the connectedness property holds, and 2) that edge-labels are minimal.

Connectedness property specifies that for any 2 vertices u and v, if vertices u and v contain variable X, then there must be a path u, w1,. . ., wm, v between u and v such that every vertex on this path contains variable X. There are two cases here. 1) u and v correspond to 2 mini-buckets in the same bucket, or 2) u and v correspond to mini-buckets in different buckets. In case 1 we have 2 further cases, 1a) variable X is being eliminated in this bucket, or 1b) variable X is not eliminated in this bucket. In case 1a, each mini-bucket must contain X and all mini-buckets of the bucket are connected as a chain, so the connectedness property holds. In case 1b, vertexes u and v connect to their (respectively) parents, who in turn connect to their parents, etc. until a bucket in the scheme where variable X is eliminated. All nodes along this chain connect variable X, so the connectedness property holds. Case 2 resolves like case 1b.

To show that edge labels are minimal, we need to prove that there are no cycles with respect to edge labels. If there is a cycle with respect to variable X, then it must involve at least one in-edge (edge connecting two mini-buckets in the same bucket). This means variable X must be the variable being eliminated in the bucket of this in-edge. That means variable X is not contained in any of the parents of the mini-buckets of this bucket. Therefore, in order for the cycle to exist, another in-edge down the bucket-tree from this bucket must contain X. However, this is impossible as this would imply that variable X is eliminated twice. □

Example 6 Figure 18a shows the trace of procedure schematic mini-bucket(3) applied to the problem described in Figure 2a. The decomposition in Figure 18b is created by the algorithm graph structuring. The only cluster partitioned is that of F into two scopes (FCD) and (BF), connected by an in-edge labeled with F.

Figure 18
Join-graph decompositions.

A range of edge-labeled join-graphs is shown in Figure 19. On the left side we have a graph with smaller clusters, but more cycles. This is the type of graph IBP works on. On the right side we have a tree decomposition, which has no cycles at the expense of bigger clusters. In between, there could be a number of join-graphs where maximum cluster size can be traded for number of cycles. Intuitively, the graphs on the left present less complexity for join-graph algorithms because the cluster size is smaller, but they are also likely to be less accurate. The graphs on the right side are computationally more complex, because of the larger cluster size, but they are likely to be more accurate.

Figure 19
Join-graphs.

4.4 The Inference Power of IJGP

The question we address in this subsection is why propagating the messages iteratively should help. Why is IJGP upon convergence superior to IJGP with one iteration and superior to MC? One clue can be provided when considering deterministic constraint networks which can be viewed as “extreme probabilistic networks”. It is known that constraint propagation algorithms, which are analogous to the messages sent by belief propagation, are guaranteed to converge and are guaranteed to improve with iteration. The propagation scheme of IJGP works similar to constraint propagation relative to the flat network abstraction of the probability distribution (where all non-zero entries are normalized to a positive constant), and propagation is guaranteed to be more accurate for that abstraction at least.

In the following we will shed some light on the IJGP's behavior by making connections with the well-known concept of arc-consistency from constraint networks (Dechter, 2003). We show that: (a) if a variable-value pair is assessed as having a zero-belief, it remains as zero-belief in subsequent iterations; (b) that any variable-value zero-beliefs computed by IJGP are correct; (c) in terms of zero/non-zero beliefs, IJGP converges in finite time. We have also empirically investigated the hypothesis that if a variable-value pair is assessed by IBP or IJGP as having a positive but very close to zero belief, then it is very likely to be correct. Although the experimental results shown in this paper do not contradict this hypothesis, some examples in more recent experiments by Dechter, Bidyuk, Mateescu, and Rollon (2010) invalidate it.

4.4.1 IJGP and Arc-Consistency

For any belief network we can define a constraint network that captures the assignments having strictly positive probability. We will show a correspondence between IJGP applied to the belief network and an arc-consistency algorithm applied to the constraint network. Since arc-consistency algorithms are well understood, this correspondence not only proves the target claims, but may provide additional insight into the behavior of IJGP. It justifies the iterative application of belief propagation, and it also illuminates its “distance” from being complete.

Definition 13 (constraint satisfaction problem) A Constraint Satisfaction Problem (CSP) is a triple left angle bracketX, D, Cright angle bracket, where X = {X1,. . ., Xn} is a set of variables associated with a set of discrete-valued domains D = {D1,. . ., Dn} and a set of constraints C = {C1,. . ., Cm}. Each constraint Ci is a pair left angle bracketSi, Riright angle bracket where Ri is a relation Ri [subset, dbl equals] DSi defined on a subset of variables Si [subset, dbl equals] X and DSi is a Cartesian product of the domains of variables Si. The relation Ri denotes all compatible tuples of DSi allowed by the constraint. A projection operator π creates a new relation, πSj(Ri)={xxDSjandy,yDSi=SjandxyRi}, where Sj [subset, dbl equals] Si. Constraints can be combined with the join operator [bowtie], resulting in a new relation, RiRj={xπSi(x)RiandπSj(x)Rj}. A solution is an assignment of values to all the variables x = (x1,. . . xn), [set membership] Di, such thatCi [set membership] C, xSi [set membership] Ri. The constraint network represents its set of solutions, iCi.

Given a belief network B, we define a flattening of the Bayesian network into a constraint network called flat(B), where all the zero entries in a probability table are removed from the corresponding relation. The network flat(B) is defined over the same set of variables and has the same set of domain values as B.

Definition 14 (flat network) Given a Bayesian network B=X,D,G,P, the flat network flat(B) is a constraint network, where the set of variables is X, and for every Xi [set membership] X and its CPT P(Xipa(Xi))B we define a constraint RFi over the family of Xi, Fi={Xi}pa(Xi) as follows: for every assignment x = (xi, xpa(Xi) to Fi, (xi, xpa(Xi)) [set membership] RFi iff P(xi|xpa(Xi)) > 0.

Theorem 6 Given a belief network B=X,D,G,P, where X = {X1,. . ., Xn}, for any tuple x=(x1,,xn):PB(x)>0xsol(flat(B)) where sol(flat(B)) is the set}of solutions of the flat constraint network.

Proof: PB(x)>0Πi=1nP(xixpa(Xi))>0i{1,,n},P(xixpa(Xi))>0i{1,,n},(xi,xpa(Xi))RFixsol(flat(B)). □

Constraint propagation is a class of polynomial time algorithms that are at the center of constraint processing techniques. They were investigated extensively in the past three decades and the most well known versions are arc-, path-, and i-consistency (Dechter, 1992, 2003).

Definition 15 (arc-consistency) (Mackworth, 1977) Given a binary constraint network (X, D, C), the network is arc-consistent iff for every binary constraint Rij [set membership] C, every value v [set membership] Di has a value u [set membership] Dj s.t. (v, u) [set membership] Rij.

Note that arc-consistency is defined for binary networks, namely the relations involve at most two variables. When a binary constraint network is not arc-consistent, there are algorithms that can process it and enforce arc-consistency. The algorithms remove values from the domains of the variables that violate arc-consistency until an arc-consistent network is generated. There are several versions of improved performance arc-consistency algorithms, however we will consider a non-optimal distributed version, which we call distributed arc-consistency.

Definition 16 (distributed arc-consistency algorithm) The algorithm distributed arc-consistency is a message-passing algorithm over a constraint network. Each node is a variable, and maintains a current set of viable values Di. Let ne(i) be the set of neighbors of Xi in the constraint graph. Every node Xi sends a message to any node Xj [set membership] ne(i), which consists of the values in Xj's domain that are consistent with the current Di, relative to the constraint Rji that they share. Namely, the message that Xi sends to Xj, denoted by Dij, is:

Dijπj(RjiDi)
(1)

and in addition node i computes:

DiDi(kne(i)Dki)
(2)

Clearly the algorithm can be synchronized into iterations, where in each iteration every node computes its current domain based on all the messages received so far from its neighbors (Eq. 2), and sends a new message to each neighbor (Eq. 1). Alternatively, Equations 1 and 2 can be combined. The message Xi sends to Xj is:

Dijπj(RjiDikne(i)Dki)
(3)

Next we will define a join-graph decomposition for the flat constraint network so that we can establish a correspondence between the join-graph decomposition of a Bayesian network B and the join-graph decomposition of its flat network flat(B). Note that for constraint networks, the edge labeling θ can be ignored.

Definition 17 (join-graph decomposition of the flat network) Given a join-graph decomposition D = left angle bracketJG, χ, ψ, θright angle bracket of a Bayesian network B, the join-graph decomposition Dflat = left angle bracketJG, χ, ψflatright angle bracket of the flat constraint network flat(B) has the same underlying graph structure JG = (V, E) as D, the same variable-labeling of the clusters χ, and the mapping ψflat maps each cluster to relations corresponding to CPTs, namely Ri [set membership] ψflat(v) iff CPT pi [set membership] (v).

The distributed arc-consistency algorithm of Definition 16 can be applied to the join-graph decomposition of the flat network. In this case, the nodes that exchange messages are the clusters (namely the elements of the set v of JG). The domain of a cluster of v is the set of tuples of the join of the original relations in the cluster (namely the domain of cluster u is Rψflat(u)R). The constraints are binary, and involve clusters of v that are neighbors. For two clusters u and v, their corresponding values tu and tv (which are tuples representing full assignments to the variables in the cluster) belong to the relation Ruv (i.e., (tu, tv) [set membership] Ru,v) if the projections over the separator (or labeling θ) between u and v are identical, namely πθ((u,v))tu = πθ((u,v))tv.

We define below the algorithm relational distributed arc-consistency (RDAC), that applies distributed arc-consistency to any join-graph decomposition of a constraint network. We call it relational to emphasize that the nodes exchanging messages are in fact relations over the original problem variables, rather than simple variables as is the case for arc-consistency algorithms.

Definition 18 (relational distributed arc-consistency algorithm: RDAC over a join-graph) Given a join-graph decomposition of a constraint network, let Ri and Rj be the relations of two clusters (Ri and Rj are the joins of the respective constraints in each cluster), having the scopes Si and Sj, such that SiSj. The message Ri sends to Rj denoted h(i,j) is defined by:

h(i,j)πSiSj(Ri)
(4)

where ne(i)={jSiSj} is the set of relations (clusters) that share a variable with Ri. Each cluster updates its current relation according to:

RiRi(kne(i)h(k,i))
(5)

Algorithm RDAC iterates until there is no change.

Equations 4 and 5 can be combined, just like in Equation 3. The message that Ri sends to Rj becomes:

h(i,j)πSiSj(Ri(kne(i)h(k,i)))
(6)

To establish the correspondence with IJGP, we define the algorithm IJGP-RDAC that applies RDAC in the same order of computation (schedule of processing) as IJGP.

Definition 19 (IJGP-RDAC algorithm) Given the Bayesian network B=X,D,G,P, let Dflat = left angle bracketJG, χ, ψflat, θright angle bracket be any join-graph decomposition of the flat network flat(B). The algorithm IJGP-RDAC is applied to the decomposition Dflat of flat(B), and can be described as IJGP applied to D, with the following modifications:

  1. Instead of [product], we use [bowtie].
  2. Instead of ∑, we use π.
  3. At end end, we update the domains of variables by:
    DiDiπXi((vne(u)h(v,u))(Rψ(u)R))
    (7)
    where u is the cluster containing Xi.

Note that in algorithm IJGP-RDAC, we could first merge all constraints in each cluster u into a single constraint Ru=Rψ(u) R. From our construction, IJGP-RDAC enforces arc-consistency over the join-graph decomposition of the flat network. When the join-graph Dflat is a join-tree, IJGP-RDAC solves the problem namely it finds all the solutions of the constraint network.

Proposition 5 Given the join-graph decomposition Dflat = left angle bracketJG, χ, ψflat, θright angle bracket, JG = (V, E), of the flat constraint network flat(B), corresponding to a given join-graph decomposition D of a Bayesian network B=X,D,G,P, the algorithm IJGP-RDAC applied to Dflat enforces arc-consistency over the join-graph Dflat.

Proof: IJGP-RDAC applied to the join-graph decomposition Dflat = left angle bracketJG, χ, ψflat, θright angle bracket, JG = (V, E), is equivalent to applying RDAC of Definition 18 to a constraint network that has vertices V as its variables and {Rψ(u)RuV} as its relations. □

Following the properties of convergence of arc-consistency, we can show that:

Proposition 6 Algorithm IJGP-RDAC converges in O(m · r) iterations, where m is the number of edges in the join-graph and r is the maximum size of a separator Dsep(u,v) between two clusters.

Proof: This follows from the fact messages (which are relations) between clusters in IJGP-RDAC change monotonically, as tuples are only successively removed from relations on separators. Since the size of each relation on a separator is bounded by r and there are m edges, no more than O(m · r) iterations will be needed. □

In the following we will establish an equivalence between IJGP and IJGP-RDAC in terms of zero probabilities.

Proposition 7 When IJGP and IJGP-RDAC are applied in the same order of computation, the messages computed by IJGP are identical to those computed by IJGP-RDAC in terms of zero / nonzero probabilities. That is, h(u,v)(x) ≠ 0 in IJGP iff x [set membership] h(u,v) in IJGP-RDAC.

Proof: The proof is by induction. The base case is trivially true since messages h in IJGP are initialized to a uniform distribution and messages h in IJGP-RDAC are initialized to complete relations.

The induction step. Suppose that h(u,v)IJGP is the message sent from u to v by IJGP. We will show that if h(u,v)IJGP(x)0, then xh(u,v)IJGPRDAC where h(u,v)IJGPRDAC is the message sent by IJGP-RDAC from u to v. Assume that the claim holds for all messages received by u from its neighbors. Let f [set membership] clusterv(u) in IJGP and Rf be the corresponding relation in IJGP-RDAC, and t be an assignment of values to variables in elim(u, v). We have h(u,v)IJGP(x)0elim(u,v)ff(x)0t,ff(x,t)0t,f,f(x,t)0t,f,πscope(Rf)(x,t)Rft,πelim(u,v)(Rfπscope(Rf)(x,t))h(u,v)IJGPRDACxh(u,v)IJGPRDAC. □

Next we will show that IJGP computing marginal probability P(Xi = xi) = 0 is equivalent to IJGP-RDAC removing xi from the domain of variable Xi.

Proposition 8 IJGP computes P(Xi = xi) = 0 iff IJGP-RDAC decides that xi [set membership] Di.

Proof: According to Proposition 7 messages computed by IJGP and IJGP-RDAC are identical in terms of zero probabilities. Let f [set membership] cluster(u) in IJGP and Rf be the corresponding relation in IJGP-RDAC, and t be an assignment of values to variables in χ(u)\Xi. We will show that when IJGP computes P(Xi = xi) = 0 (upon convergence), then IJGP-RDAC computes xi [set membership] Di. We have P(Xi=xi)=X=Xiff(xi)=0t,ff(xi,t)=0t,f,f(xi,t)=0t,Rf,πxscope(Rf)(xi,t)Rft,(xi,t)(RfRf(xi,t))xiDiπXi(RfRf(xi,t))xiDi. Since arc-consistency is sound, so is the decision of zero probabilities. □

Next we will show that P(Xi = xi) = 0 computed by IJGP is sound.

Theorem 7 Whenever IJGP finds P(Xi = xi) = 0, then the probability P(Xi) expressed by the Bayesian network conditioned on the evidence is 0 as well.

Proof: According to Proposition 8, whenever IJGP finds P(Xi = xi) = 0, the value xi is removed from the domain Di by IJGP-RDAC, therefore value xi [set membership] Di is a no-good of the network flat(B), and from Theorem 6 it follows that PB(Xi=xi)=0. □

In the following we will show that the time it takes IJGP to find all P(Xi = xi) = 0 is bounded.

Proposition 9 IJGP finds all P(Xi = xi) = 0 in finite time, that is, there exists a number k, such that no P(Xi = xi) = 0 will be found after k iterations.

Proof: This follows from the fact that the number of iterations it takes for IJGP to compute P(Xi = xi) = 0 is exactly the same number of iterations IJGP-RDAC takes to remove xi from the domain Di (Proposition 7 and Proposition 8), and the fact the IJGP-RDAC runtime is bounded (Proposition 6). □

Previous results also imply that IJGP is monotonic with respect to zeros.

Proposition 10 Whenever IJGP finds P(Xi = xi) = 0, it stays 0 during all subsequent iterations.

Proof: Since we know that relations in IJGP-RDAC are monotonically decreasing as the algorithm progresses, it follows from the equivalence of IJGP-RDAC and IJGP (Proposition 7) that IJGP is monotonic with respect to zeros. □

4.4.2 A Finite Precision Problem

On finite precision machines there is the danger that an underflow can be interpreted as a zero value. We provide here a warning that an implementation of belief propagation should not allow the creation of zero values by underflow. We show an example in Figure 20 where IBP's messages converge in the limit (i.e., in an infinite number of iterations), but they do not stabilize in any finite number of iterations. If all the nodes Hk are set to value 1, the belief for any of the Xi variables as a function of iteration is given in the table in Figure 20. After about 300 iterations, the finite precision of our computer is not able to represent the value for Bel(Xi = 3), and this appears to be zero, yielding the final updated belief (5, 5, 0), when in fact the true updated belief should be (0, 0, 1). Notice that (5, 5, 0) cannot be regarded as a legitimate fixed point for IBP. Namely, if we would initialize IBP with the values (5, 5, 0), then the algorithm would maintain them, appearing to have a fixed point, but initializing IBP with zero values cannot be expected to be correct. When we initialize with zeros we forcibly introduce determinism in the model, and IBP will always maintain it afterwards.

Figure 20
Example of a finite precision problem.

However, this example does not contradict our theory because, mathematically, Bel(Xi = 3) never becomes a true zero, and IBP never reaches a quiescent state. The example shows that a close to zero belief network can be arbitrarily inaccurate. In this case the inaccuracy seems to be due to the initial prior belief which are so different from the posterior ones.

4.4.3 Accuracy of IBP Across Belief Distribution

We present an empirical evaluation of the accuracy of IBP's prediction for the range of belief distribution from 0 to 1. These results also extend to IJGP. In the previous section, we proved that zero values inferred by IBP are correct, and we wanted to test the hypothesis that this property extends to small beliefs (namely, that are very close to zero). That is, if IBP infers a posterior belief close to zero, then it is likely to be correct. The results presented in this paper seem to support the hypothesis, however new experiments by Dechter et al. (2010) show that it is not true in general. We do not have yet a good characterization of the cases when the hypothesis is confirmed.

To test this hypothesis, we computed the absolute error of IBP per intervals of [0, 1]. For a given interval [a, b], where 0 ≤ a < b ≤ 1, we use measures inspired from information retrieval: Recall Absolute Error and Precision Absolute Error.

Recall is the absolute error averaged over all the exact posterior beliefs that fall into the interval [a, b]. For Precision, the average is taken over all the approximate posterior belief values computed by IBP to be in the interval [a, b]. Intuitively, Recall([a,b]) indicates how far the belief computed by IBP is from the exact, when the exact is in [a, b]; Precision([a,b]) indicates how far the exact is from IBP's prediction, when the value computed by IBP is in [a, b].

Our experiments show that the two measures are strongly correlated. We also show the histograms of distribution of belief for each interval, for the exact and for IBP, which are also strongly correlated. The results are given in Figures 21 and and22.22. The left Y axis corresponds to the histograms (the bars), the right Y axis corresponds to the absolute error (the lines).

Figure 21
Coding, N=200, 1000 instances, w*=15.
Figure 22
10×10 grids, 100 instances, w*=15.

We present results for two classes of problems: coding networks and grid network. All problems have binary variables, so the graphs are symmetric about 0.5 and we only show the interval [0, 0.5]. The number of variables, number of iterations and induced width w* are reported for each graph.

Coding networks

IBP is famously known to have impressive performance on coding networks. We tested on linear block codes, with 50 nodes per layer and 3 parent nodes. Figure 21 shows the results for three different values of channel noise: 0.2, 0.4 and 0.6. For noise 0.2, all the beliefs computed by IBP are extreme. The Recall and Precision are very small, of the order of 10−11. So, in this case, all the beliefs are very small (ε small) and IBP is able to infer them correctly, resulting in almost perfect accuracy (IBP is indeed perfect in this case for the bit error rate). When the noise is increased, the Recall and Precision tend to get closer to a bell shape, indicating higher error for values close to 0.5 and smaller error for extreme values. The histograms also show that less belief values are extreme as the noise is increased, so all these factors account for an overall decrease in accuracy as the channel noise increases. These networks are examples with a large number of ε-small probabilities and IBP is able to infer them correctly (absolute error is small).

Grid networks

We present results for grid networks in Figure 22. Contrary to the case of coding networks, the histograms show higher concentration of beliefs around 0.5. However, the accuracy is still very good for beliefs close to zero. The absolute error peaks close to 0 and maintains a plateau, as evidence is increased, indicating less accuracy for IBP.

5. Experimental Evaluation

As we anticipated in the summary of Section 3, and as can be clearly seen now by the structuring of a bounded join-graph, there is a close relationship between the mini-clustering algorithm MC(i) and IJGP(i). In particular, one iteration of IJGP(i) is similar to MC(i). MC sends messages up and down along the clusters that form a set of trees. IJGP has additional connections that allow more interaction between the mini-clusters of the same cluster. Since this is a cyclic structure, iterating is facilitated, with its virtues and drawbacks.s

In our evaluation of IJGP(i), we focus on two different aspects: (a) the sensitivity of parametric IJGP(i) to its i-bound and to the number of iterations; (b) a comparison of IJGP(i) with publicly available state-of-the-art approximation schemes.

5.1 Effect of i-bound and Number of Iterations

We tested the performance of IJGP(i) on random networks, on M-by-M grids, on the two benchmark CPCS files with 54 and 360 variables, respectively and on coding networks. On each type of networks, we ran IBP, MC(i) and IJGP(i), while giving IBP and IJGP(i) the same number of iterations.

We use the partitioning method described in Section 4.3 to construct a join-graph. To determine the order of message computation, we recursively pick an edge (u,v), such that node u has the fewest incoming messages missing.

For each network except coding, we compute the exact solution and compare the accuracy using the absolute and relative error, as before, as well as the KL (Kullback-Leibler) distance - Pexact(X = a) · log(Pexact(X = a)Papproximation(X = a)) averaged over all values, all variables and all problems. For coding networks we report the Bit Error Rate (BER) computed as described in Section 3.2. We also report the time taken by each algorithm.

The random networks were generated using parameters (N,K,C,P), where N is the number of variables, K is their domain size, C is the number of conditional probability tables (CPTs) and P is the number of parents in each CPT. Parents in each CPT are picked randomly and each CPT is filled randomly. In grid networks, N is a square number and each CPT is filled randomly. In each problem class, we also tested different numbers of evidence variables. As before, the coding networks are from the class of linear block codes, where σ is the channel noise level. Note that we are limited to relatively small and sparse problem instances because our evaluation measures are based on comparing against exact figures.

Random networks

results for networks having N=50, K=2, C=45 and P=3 are given in Table 4 and in Figures 23 and and24.24. For IJGP(i) and MC(i) we report 3 different values of i-bound: 2, 5, 8. For IBP and IJGP(i) we report results for 3 different numbers of iterations: 1, 5, 10. We report results for 3 different numbers of evidence: 0, 5, 10. From Table 4 and Figure 23a we see that IJGP(i) is always better than IBP (except when i=2 and number of iterations is 1), sometimes by an order of magnitude, in terms of absolute error, relative error and KL distance. IBP rarely changes after 5 iterations, whereas IJGP(i)'s solution can be improved with more iterations (up to 15-20). As theory predicted, the accuracy of IJGP(i) for one iteration is about the same as that of MC(i). But IJGP(i) improves as the number of iterations increases, and is eventually better than MC(i) by as much as an order of magnitude, although it clearly takes more time, especially when the i-bound is large.

Figure 23
Random networks: KL distance.
Figure 24
Random networks: Time.
Table 4
Random networks: N=50, K=2, C=45, P=3, 100 instances, w*=16.

Figure 23a shows a comparison of all algorithms with different numbers of iterations, using the KL distance. Because the network structure changes with different i-bounds, we do not necessarily see monotonic improvement of IJGP with i-bound for a given number of iterations (as is the case with MC). Figure 23b shows how IJGP converges with more iterations to a smaller KL distance than IBP. As expected, the time taken by IJGP (and MC) varies exponentially with the i-bound (see Figure 24).

Grid networks

results with networks of N=81, K=2, 100 instances are very similar to those of random networks. They are reported in Table 5 and in Figure 25, where we can see the impact of having evidence (0 and 5 evidence variables) on the algorithms. IJGP at convergence gives the best performance in both cases, while IBP's performance deteriorates with more evidence and is surpassed by MC with i-bound 5 or larger.

Figure 25
Grid 9×9: KL distance.
Table 5
9×9 grid, K=2, 100 instances, w*=12.

CPCS networks

results with CPCS54 and CPCS360 are given in Table 6 and Figure 26, and are even more pronounced than those of random and grid networks. When evidence is added, IJGP(i) is more accurate than MC(i), which is more accurate than IBP, as can be seen in Figure 26a.

Figure 26
CPCS360: KL distance.
Table 6
CPCS54 50 instances, w*=15; CPCS360 10 instances, w*=20.

Coding networks

results are given in Table 7. We tested on large networks of 400 variables, with treewidth w*=43, with IJGP and IBP set to run 30 iterations (this is more than enough to ensure convergence). IBP is known to be very accurate for this class of problems and it is indeed better than MC. However we notice that IJGP converges to slightly smaller BER than IBP even for small values of the i-bound. Both the coding network and CPCS360 show the scalability of IJGP for large size problems. Notice that here the anytime behavior of IJGP is not clear.

Table 7
Coding networks: N=400, P=4, 500 instances, 30 iterations, w*=43.

In summary, we see that IJGP is almost always superior to both IBP and MC(i) and is sometimes more accurate by several orders of magnitude. One should note that IBP cannot be improved with more time, while MC(i) requires a large i-bound for many hard and large networks to achieve reasonable accuracy. There is no question that the iterative application of IJGP is instrumental to its success. In fact, IJGP(2) in isolation appears to be the most cost-effective variant.

5.2 Comparing IJGP with Other Algorithms

In this section we provide a comparison of IJGP with state-of-the-art publicly available schemes. The comparison is based on a recent evaluation of algorithms performed at the Uncertainty in AI 2008 conference4. We will present results on solving the belief updating task (also called the task of computing posterior node marginals - MAR). We first give a brief overview of the schemes that we experimented and compared with.

1. EDBP - Edge Deletion for Belief Propagation

EDBP (Choi & Darwiche, 2006a, 2006b) is an approximation algorithm for Belief Updating. It solves exactly a simplified version of the original problem, obtained by deleting some of the edges of the problem graph. Edges to be deleted are selected based on two criteria: quality of approximation and complexity of computation (tree-width reduction). Information loss from lost dependencies is compensated for by introducing auxiliary network parameters. This method corresponds to Iterative Belief Propagation (IBP) when enough edges are deleted to yield a poly-tree, and corresponds to generalized BP otherwise.

2. TLSBP - A truncated Loop series Belief propagation algorithm

TLSBP is based on the loop series expansion formula of Chertkov and Chernyak (2006) which specifies a series of terms that need to be added to the solution output by BP so that the exact solution can be recovered. This series is basically a sum over all so-called generalized loops in the graph. Unfortunately, because the number of these generalized loops can be prohibitively large, the series is of little value. The idea in TLSBP is to truncate the series by decomposing all generalized loops into simple and smaller loops, thus limiting the number of loops to be summed. In our evaluation, we used an implementation of TLSBP available from the work of Gomez, Mooji, and Kappen (2007). The implementation can handle binary networks only.

3. EPIS - Evidence Pre-propagation Importance Sampling

EPIS (Yuan & Druzdzel, 2003) is an importance sampling algorithm for Belief Updating. It is well known that sampling algorithms perform poorly when presented with unlikely evidence. However, when samples are weighted by an importance function, good approximation can be obtained. This algorithm computes an approximate importance function using loopy belief propagation and ε-cutoff heuristic. We used an implementation of EPIS available from the authors. The implementation works on Bayesian networks only.

4. IJGP - Iterative Join-Graph Propagation

In the evaluation, IJGP(i) was first run with i=2, until convergence, then with i=3, until convergence, etc. until i= treewidth (when i-bound=treewidth, the join-graph becomes a join-tree and IJGP becomes exact). As preprocessing, the algorithm performed SAT-based variable domain pruning by converting zero probabilities in the problem to a SAT problem and performing singleton-consistency enforcement. Because the problem size may reduce substantially, in some cases, this preprocessing step may have a significant impact on the time-complexity of IJGP, amortized over the increasing i-bound. However, for a given i-bound, this step improves the accuracy of IJGP only marginally.

5. SampleSearch

SampleSearch (Gogate & Dechter, 2007) is a specialized importance sampling scheme for graphical models that contain zero probabilities in their CPTs. On such graphical models, importance sampling suffers from the rejection problem in that it generates a large number of samples which have zero weight. SampleSearch circumvents the rejection problem by sampling from the backtrack-free search space in which every assignment (sample) is guaranteed to have non-zero weight. The backtrack-free search space is constructed on the fly by interleaving sampling with backtracking style search. Namely, when a sample is supposed to be rejected because its weight is zero, the algorithm continues instead with systematic backtracking search, until a non zero weight sample is found. For the evaluation version, the importance distribution of SampleSearch was constructed from the output of IJGP with i-bound of 3. For more information on how the importance distribution is constructed from the output of IJGP, see the work by Gogate (2009).

The evaluation was conducted on the following benchmarks (see footnote 4 for details):

  1. UAI06-MPE - from UAI-06, 57 instances, Bayesian networks (40 instances were used).
  2. UAI06-PE - from UAI-06, 78 instances, Bayesian networks (58 instances were used).
  3. Relational Bayesian networks - constructed from the Primula tool, 251 instances, binary variables, large networks with large tree-width, but with high levels of determinism (30 instances were used).
  4. Linkage networks - 22 instances, tree-width 20-35, Markov networks (5 instances were used).
  5. Grids - from 12×12 to 50×50, 320 instances, treewidth 12-50.
  6. BN2O networks - Two-layer Noisy-OR Bayesian networks, 18 instances, binary variables, up to 55 variables, treewidth 24-27.
  7. WCSPs - Weighted CSPs, 97 instances, Markov networks (18 instances were used).
  8. Promedas - real-world medical diagnosis, 238 instances, tree-width 1-60, Markov networks (46 instances were used).

Table 8 shows the scope of our experimental study. A √ indicates that the solver was able to handle the benchmark type and therefore evaluated on it while a lack of a √ indicates otherwise.

Table 8
Scope of our experimental study.

We measure the performance of the algorithms in terms of a KL-distance based score. Formally, the score of a solver on a problem instance is equal to 10avgkld where avgkld is the average KL distance between the exact marginal (which was computed using the UCLA Ace solver, see Chavira & Darwiche, 2008) and the approximate marginal output by the solver. If a solver does not output a solution, we consider its KL-distance to be ∞. A score lies between 0 and 1, with 1 indicating that the solver outputs exact solution while 0 indicating that the solver either does not output a solution or has infinite average KL distance. Figure 27 shows the score as a function of KL distance.

Figure 27
Score as a function of KL distance.

In Figures 28--3535 we report the results of experiments with each of the problem sets. Each solver has a timeout of 20 minutes on each problem instance; when solving a problem, each solver periodically outputs the best solution found so far. Using this, we can compute, for each solver, at any point in time, the total sum of its scores over all problem instances in a particular set, called SumScore(t). On the horizontal axis, we have the time and on the vertical axis, the SumScore(t). The higher the curve of a solver is, the better (the higher the score).

Figure 28
Results on UAI-MPE networks. TLSBP is not plotted because it cannot handle UAI-MPE benchmarks.
Figure 35
Results on Promedas networks. EPIS is not plotted because it cannot handle Promedas benchmarks, which are Markov networks.

In summary, we see that IJGP shows the best performance on the first four classes of networks (UAI-MPE, UAI-PE, Relational and Linkage), it is tied with other algorithms on two classes (Grid and BN2O), and is surpassed by EDBP on the last two classes (WCSPs and Promedas). EPIS and SampleSearch, which are importance sampling schemes, are often inferior to IJGP and EDBP. In theory, the accuracy of these importance sampling schemes should improve with time. However, the rate of improvement is often unknown in practice. On the hard benchmarks that we evaluated on, we found that this rate is quite small and therefore the improvement cannot be discerned from the Figures. We discuss the results in detail below.

As mentioned earlier, TLSBP works only on binary networks (i.e., two variables per function) and therefore it was not evaluated on WCSPs, Linkage, UAI06-MPE and UAI06-PE benchmarks.

The UAI-MPE and UAI-PE instances were used in the UAI 2006 evaluation of exact solvers (for details see the report by Bilmes & Dechter, 2006). Exact marginals are available on 40 UAI-MPE instances and 58 UAI-PE instances. The results for UAI-MPE and UAI-PE instances are shown in Figures 28 and and2929 respectively. IJGP is the best performing scheme on both benchmark sets reaching a SumScore very close to the maximum possible value in both cases after about 2 minutes of CPU time. EDBP and SampleSearch are second best in both cases.

Figure 29
Results on UAI-PE networks. TLSBP is not plotted because it cannot handle UAI-PE benchmarks.

Relational network instances are generated by grounding the relational Bayesian networks using the Primula tool (Chavira, Darwiche, & Jaeger, 2006). Exact marginals are available only on 30 out of the submitted 251 instances. From Figure 30, we observe that IJGP's SumScore steadily increases with time and reaches a value very close to the maximum possible score of 30 after about 16 minutes of CPU time. SampleSearch is the second best performing scheme. EDBP, TLSBP and EPIS perform quite poorly on these instances reaching the SumScore of 10, 13 and 13 respectively after 20 minutes of CPU time.

Figure 30
Results on relational networks.

The Linkage instances are generated by converting linkage analysis data into a Markov network using the Superlink tool (Fishelson & Geiger, 2003). Exact marginals are available only on 5 out of the 22 instances. The results are shown in Figure 31. After about one minute of CPU time, IJGP's SumScore is close to 5 which remains steady thereafter while EDBP only reaches a SumScore of 2 in 20 minutes. SampleSearch is the second best performing scheme while EDBP is third best.

Figure 31
Results on Linkage networks. EPIS and TLSBP are not plotted because they cannot handle Linkage networks.

The results on Grid networks are shown in Figure 32. The sink node of the grid is the evidence node. The deterministic ratio p is a parameter specifying the fraction of nodes that are deterministic, that is, whose values are determined given the values of their parents. The evaluation benchmark set consists of 30 instances having p = 50%,75% and 90% with exact marginals available on 27 instances only. EPIS, IJGP, SampleSearch and EDBP are in a close tie on this network, while TLSBP has the lowest performance. While hard to see, EPIS is just slightly the best performing scheme, IJGP is the second best followed by SampleSearch and EDBP. On this instances IJGP's SumScore increases steadily with time.

Figure 32
Results on Grid networks.

The results on BN2O instances appear in Figure 33. This is again a very close tie, in this case of all five algorithms. IJGP has a minuscule decrease of SumScore with time from 17.85 to 17.7. Although in general an improvement in accuracy is expected for IJGP with higher i-bound, it is not guaranteed, and this is an example when it does not happen. The other solvers reach the maximum possible SumScore of 18 (or very close to it) after about 6 minutes of CPU time.

Figure 33
Results on BN2O networks. All solvers except IJGP quickly converge to the maximum possible score of 18 and are therefore indistinguishable in the Figure.

The WCSP benchmark set has 97 instances. However we used only the 18 instances for which exact marginals are available. Therefore the maximum SumScore that an algorithm can reach is 18. The results are shown in Figure 34. EDBP reaches a SumScore of 17 after almost 3 minutes of CPU time while IJGP reaches a SumScore of 13 after about 3 minutes. The SumScores of both IJGP and EDBP remain unchanged in the interval from 3 to 20 minutes. After looking at the raw results, we found that IJGP's score was zero on 5 instances out of 18. This was because the singleton consistency component implemented via the SAT solver did not finish in 20 minutes on these instances. Although the singleton consistency step generally helps to reduce the practical time complexity of IJGP on most instances, it adversely affects it on these WCSP instances.

Figure 34
Results on WCSPs networks. EPIS and TLSBP are not plotted because they cannot handle WCSPs.

The Promedas instances are Noisy-OR binary Bayesian networks (Pearl, 1988). These instances are characterized by extreme marginals. Namely, for a given variable, the marginals are of the form (1 − ε, ε) where ε is a very small positive constant. Exact marginals are available only on 46 out of the submitted 238 instances. On these structured problems (see Figure 35), we see that EDBP is the best performing scheme reaching a SumScore very close to 46 after about 7 minutes of CPU time while TLSBP and IJGP are able to reach a SumScore of about 40 in 20 minutes.

6. Related Work

There are numerous lines of research devoted to the study of belief propagation algorithms, or message-passing schemes in general. Throughout the paper we have mentioned and compared with other related work, especially in the experimental evaluation section. We give here a short summary of the developments in belief propagation and present some related schemes that were not mentioned before. For additional information see also the recent review by Koller (2010).

About a decade ago, Iterative Belief Propagation (Pearl, 1988) received a lot of interest from the information theory and coding community. It was realized that two of the best error-correcting decoding algorithms were actually performing belief propagation in networks with cycles. The LDPC code (low-density parity-check) introduced long time ago by Gallager (1963), is now considered one of the most powerful and promising schemes that often performs impressively close to Shannon's limit. Turbo codes (Berrou, Glavieux, & Thitimajshima, 1993) are also very efficient in practice and can be understood as an instance of belief propagation (McEliece et al., 1998).

A considerable progress towards understanding the behavior and performance of BP was made through concepts from statistical physics. Yedidia et al. (2001) showed that IBP is strongly related to the Bethe-Peierls approximation of variational (Gibbs) free energy in factor graphs. The Bethe approximation is a particular case of the more general Kikuchi (1951) approximation. Generalized Belief Propagation (Yedidia et al., 2005) is an application of the Kikuchi approximation that works with clusters of variables, on structures called region graphs. Another algorithm that employs the region-based approach is Cluster Variation Method (CVM) (Pelizzola, 2005). These algorithms focus on selecting a good region-graph structure to account for the over-counting (and over-over-counting, etc.) of evidence. We view generalized belief propagation more broadly as any belief propagation over nodes which are clusters of functions. Within this view IJGP, and GBP as defined by Yedidia et al. (2001), as well as CVM, are special realizations of generalized belief propagation.

Belief Propagation on Partially Ordered Sets (PBP) (McEliece & Yildirim, 2002) is also a generalized form of Belief Propagation that minimizes the Bethe-Kikuchi variational free energy, and that works as a message-passing algorithm on data structures called partially ordered sets, which has junction graphs and factor graphs as examples. There is one-to-one correspondence between fixed points of PBP and stationary points of the free energy. PBP includes as special cases many other variants of belief propagation. As we noted before, IJGP is basically the same as PBP.

Expectation Propagation (EP) (Minka, 2001) is a an iterative approximation algorithm for computing posterior belief in Bayesian networks. It combines assumed-density filtering (ADF), an extension of the Kalman filter (used to approximate belief states using expectations, such as mean and variance), with IBP, and iterates until these expectations are consistent throughout the network. TreeEP (Minka & Qi, 2004) deals with cyclic problem by reducing the problem graph to a tree subgraph and approximating the remaining edges. The relationship between EP and GBP is discussed by Welling, Minka, and Teh (2005).

Survey Propagation (SP) (Braunstein et al., 2005) solves hard satisfiable (SAT) problems using a message-passing algorithm on a factor graph consisting of variable and clause nodes. SP is inspired by an algorithm called Warning Propagation (WP) and by BP. WP can determine if a tree-problem is SAT, and if it is then it can provide a solution. BP can compute the number of satisfying assignments for a tree-problem, as well as the fraction of the assignments where a variable is true. These two algorithms are used as heuristics to define the SP algorithm, that is shown to be more efficient than either of them on arbitrary networks. SP is still a heuristic algorithm with no guarantee of convergence. SP was inspired by the new concept of “cavity method” in statistical physics, and can be interpreted as BP where variables can not only take the values true or false, but also the extra “don’t care” value. For a more detailed treatment see the book by Mézard and Montanari (2009).

7. Conclusion

In this paper we investigated a family of approximation algorithms for Bayesian networks, that could also be extended to general graphical models. We started with bounded inference algorithms and proposed Mini-Clustering (MC) scheme as a generalization of Mini-Buckets to arbitrary tree decompositions. Its power lies in being an anytime algorithm governed by a user adjustable i-bound parameter. MC can start with small i-bound and keep increasing it as long as it is given more time, and its accuracy usually improves with more time. If enough time is given to it, it is guaranteed to become exact. One of its virtues is that it can also produce upper and lower bounds, a route not explored in this paper.

Inspired by the success of iterative belief propagation (IBP), we extended MC into an iterative message-passing algorithm called Iterative Join-Graph Propagation (IJGP). IJGP operates on general join-graphs that can contain cycles, but it is sill governed by an i-bound parameter. Unlike IBP, IJGP is guaranteed to become exact if given enough time.

We also make connections with well understood consistency enforcing algorithms for constraint satisfaction, giving strong support for iterating messages, and giving insight into the performance of IJGP (IBP). We show that: (1) if a value of a variable is assessed as having zero-belief in any iteration of IJGP, then it remains a zero-belief in all subsequent iterations; (2) IJGP converges in a finite number of iterations relative to its set of zero-beliefs; and, most importantly (3) that the set of zero-beliefs decided by any of the iterative belief propagation methods is sound. Namely any zero-belief determined by IJGP corresponds to a true zero conditional probability relative to the given probability distribution expressed by the Bayesian network.

Our experimental evaluation of IJGP, IBP and MC is provided, and IJGP emerges as one of the most powerful approximate algorithms for belief updating in Bayesian networks.

Figure 16
Algorithm Join-Graph Structuring(i).

Acknowledgements

This work was supported in part by the NSF grant IIS-0713118 and NIH grant R01-HG004175-02.

Footnotes

1The combination operator can also be defined axiomatically (Shenoy, 1992).

2Wexler and Meek (2008) compared the upper/lower bounding properties of the mini-bucket on computing probability of evidence. Rollon and Dechter (2010) further investigated heuristic schemes for mini-bucket partitioning.

3Note that a node may be associated with an empty set of CPTs.

4Complete results are available at http://graphmod.ics.uci.edu/uai08/Evaluation/Report.

Contributor Information

Robert Mateescu, Microsoft Research 7 J J Thomson Avenue Cambridge CB3 0FB, UK ; MOC.TFOSORCIM@SEETAMOR.

Kalev Kask, Donald Bren School of Information and Computer Science University of California Irvine Irvine, CA 92697, USA ; UDE.ICU.SCI@KSAKK.

Vibhav Gogate, Computer Science & Engineering University of Washington, Seattle Seattle, WA 98195, USA ; UDE.NOTGNIHSAW.SC@ETAGOGV..

Rina Dechter, Donald Bren School of Information and Computer Science University of California Irvine Irvine, CA 92697, USA ; UDE.ICU.SCI@RETHCED..

References

  • Arnborg SA. Efficient algorithms for combinatorial problems on graphs with bounded decomposability - a survey. BIT. 1985;25:2–23.
  • Bacchus F, Dalmao S, Pitassi T. Value elimination: Bayesian inference via backtracking search.. Proceedings of the Nineteenth Conference on Uncertainty in Artificial Intelligence (UAI'03).2003. pp. 20–28.
  • Berrou C, Glavieux A, Thitimajshima P. Near Shannon limit error-correcting coding: Turbo codes. In Proc. of the 1993 Int. Conference on Communications. 1993:1064–1070.
  • Bilmes J, Dechter R. Evaluation of probabilistic inference systems in UAI'06. 2006. http://ssli.ee.washington.edu/ bilmes/uai06InferenceEvaluation/
  • Braunstein A, Mézard M, Zecchina R. Survey propagation: An algorithm for satisfiability. Random Struct. Algorithms. 2005;27(2):201–226.
  • Chavira M, Darwiche A. On probabilistic inference by weighted model counting. Artificial Intelligence. 2008;172(6–7):772–799.
  • Chavira MD, Darwiche A, Jaeger M. Compiling relational bayesian networks for exact inference. International Journal of Approximate Reasoning. 2006;42(1-2):4–20.
  • Chertkov M, Chernyak VY. Loop series for discrete statistical models on graphs. Journal of Statistical Mechanics: Theory and Experiment. 2006:6009.
  • Choi A, Chavira M, Darwiche A. Node splitting: A scheme for generating upper bounds in bayesian networks.. Proceedings of the Twenty Third Conference on Uncertainty in Artificial Intelligence (UAI'07).2007. pp. 57–66.
  • Choi A, Darwiche A. An edge deletion semantics for belief propagation and its practical impact on approximation quality.. Proceedings of the The Twenty-First National Conference on Artificial Intelligence (AAAI'06).2006a. pp. 1107–1114.
  • Choi A, Darwiche A. A variational approach for approximating bayesian networks by edge deletion.. Proceedings of the Twenty Second Conference on Uncertainty in Artificial Intelligence (UAI'06).2006b. pp. 80–89.
  • Cooper GF. The computational complexity of probabistic inferences. Artificial Intelligence. 1990;42:393–405.
  • Dagum P, Luby M. Approximating probabilistic inference in bayesian belief networks is NP-hard. Artificial Intelligence. 1993;60(1):141–153.
  • Darwiche A. Recursive conditioning. Artificial Intelligence. 2001;125(1-2):5–41.
  • Dechter R. Constraint networks. Encyclopedia of Artificial Intelligence. 1992:276–285.
  • Dechter R. Bucket elimination: A unifying framework for probabilistic inference algorithms.. Proceedings of the Twelfth Conference on Uncertainty in Artificial Intelligence (UAI'96).1996. pp. 211–219.
  • Dechter R. Bucket elimination: A unifying framework for reasoning. Artificial Intelligence. 1999;113:41–85.
  • Dechter R. Constraint Processing. Morgan Kaufmann Publishers; 2003.
  • Dechter R, Bidyuk B, Mateescu R, Rollon E. Dechter R, Geffner H, Halpern J, editors. The power of belief propagation: A constraint propagation perspective. Heuristics, Probabilities and Causality: A Tribute to Judea Pearl. 2010
  • Dechter R, Kask K, Larrosa J. A general scheme for multiple lower bound computation in constraint optimization.. Proceedings of the Seventh International Conference on Principles and Practice of Constraint Programming (CP'01).2001. pp. 346–360.
  • Dechter R, Kask K, Mateescu R. Iterative join-graph propagation.. Proceedings of the Eighteenth Conference on Uncertainty in Artificial Intelligence (UAI'02).2002. pp. 128–136.
  • Dechter R, Mateescu R. A simple insight into iterative belief propagation's success.. Proceedings of the Nineteenth Conference on Uncertainty in Artificial Intelligence (UAI'03).2003. pp. 175–183.
  • Dechter R, Mateescu R. AND/OR search spaces for graphical models. Artificial Intelligence. 2007;171(2-3):73–106.
  • Dechter R, Pearl J. Network-based heuristics for constraint satisfaction problems. Artificial Intelligence. 1987;34:1–38.
  • Dechter R, Pearl J. Tree clustering for constraint networks. Artificial Intelligence. 1989;38:353–366.
  • Dechter R, Rish I. A scheme for approximating probabilistic inference.. Proceedings of the Thirteenth Conference on Uncertainty in Artificial Intelligence (UAI'97).1997. pp. 132–141.
  • Dechter R, Rish I. Mini-buckets: A general scheme for approximating inference. Journal of ACM. 2003;50(2):107–153.
  • Fishelson M, Geiger D. Optimizing exact genetic linkage computations.. Proceedings of the Seventh Annual International Conference on Computational Biology (RECOMB'03).2003. pp. 114–121.
  • Gallager RG. Low-Density Parity-Check Codes. MIT Press; Cambridge, MA: 1963.
  • Gogate V, Dechter R. SampleSearch: A scheme that searches for consistent samples.. Proceedings of the Eleventh International Conference on Artificial Intelligence and Statistics (AISTATS'07).2007. pp. 147–154.
  • Gogate V. Ph.D. thesis. School of Information and Computer Sciences, University of California; Irvine: 2009. Sampling Algorithms for Probabilistic Graphical models with Determinism.
  • Gomez V, Mooji JM, Kappen HJ. Truncating the loop series expansion for belief propagation. Journal of Machine Learning. 2007;8:1987–2016.
  • Gottlob G, Leone N, Scarcello F. A comparison of structural CSP decomposition methods. Artificial Intelligence. 2000:243–282.
  • Jensen FV, Lauritzen SL, Olesen KG. Bayesian updating in causal probabilistic networks by local computation. Computational Statistics Quarterly. 1990;4:269–282.
  • Kask K. Ph.D. thesis. Information and Computer Science, University of California; Irvine: 2001. Approximation algorithms for graphical models.
  • Kask K, Dechter R. A general scheme for automatic search heuristics from specification dependencies. Artificial Intelligence. 2001;129(1-2):91–131.
  • Kask K, Dechter R, Larrosa J, Dechter A. Unifying cluster-tree decompositions for reasoning in graphical models. Artificial Intelligence. 2005;166(1-2):165–193.
  • Kikuchi R. A theory of cooperative phenomena. Phys. Rev. 1951;81(6):988–1003.
  • Koller D. Dechter R, Geffner H, Halpern J, editors. Belief propagation in loopy graphs. Heuristics, Probabilities and Causality: A Tribute to Judea Pearl. 2010
  • Larrosa J, Kask K, Dechter R. Up and down mini-bucket: a scheme for approximating combinatorial optimization tasks. Tech. rep. University of California Irvine.; 2001.
  • Lauritzen SL, Spiegelhalter DJ. Local computation with probabilities on graphical structures and their application to expert systems. (Series B).Journal of the Royal Statistical Society. 1988;50(2):157–224.
  • Mackworth AK. Consistency in networks of relations. Artificial Intelligence. 1977;8(1):99–118.
  • Mézard M, Montanari A. Information, Physics and Computation. Oxford University Press; 2009.
  • Mézard M, Parisi G, Zecchina R. Analytic and algorithmic solution of random satisfiability problems. Science. 2002;297:812–815. [PubMed]
  • Maier D. The Theory of Relational Databases. Computer Science Press; Rockville, MD: 1983.
  • Mateescu R, Dechter R, Kask K. Tree approximation for belief updating.. Proceedings of the Eighteenth National Conference on Artificial Intelligence (AAAI'02).2002. pp. 553–559.
  • McEliece RJ, MacKay DJC, Cheng JF. Turbo decoding as an instance of Pearl's belief propagation algorithm. IEEE J. Selected Areas in Communication. 1998;16(2):140–152.
  • McEliece RJ, Yildirim M. Belief propagation on partially ordered sets. Mathematical Systems Theory in Biology, Communications, Computation, and Finance. 2002:275–300.
  • Minka T. Expectation propagation for approximate bayesian inference.. Proceedings of the Seventeenth Annual Conference on Uncertainty in Artificial Intelligence (UAI'01).2001. pp. 362–369.
  • Minka T, Qi Y. Tree-structured approximations by expectation propagation. Advances in Neural Information Processing Systems 16 (NIPS'03) 2004
  • Pearl J. Probabilistic Reasoning in Intelligent Systems. Morgan Kaufmann; 1988.
  • Pelizzola A. Cluster variation method in statistical physics and probabilistic graphical models. Journal of Physics A: Mathematical and General. 2005;38(33):R309–R339.
  • Rollon E, Dechter R. Evaluating partition strategies for mini-bucket elimination. The Eleventh International Symposium on Artificial Intelligence and Mathematics (ISAIM'10) 2010
  • Roth D. On the hardness of approximate reasoning. Artificial Intelligence. 1996;82(1-2):273–302.
  • Shafer GR, Shenoy PP. Probability propagation. Annals of Mathematics and Artificial Intelligence. 1990;2:327–352.
  • Shenoy PP. Valuation-based systems for bayesian decision analysis. Operations Research. 1992;40:463–484.
  • Welling M, Minka TP, Teh YW. Structured region graphs: Morphing EP into GBP.. Proceedings of the Twenty First Conference on Uncertainty in Artificial Intelligence (UAI'05).2005. pp. 609–614.
  • Wexler Y, Meek C. MAS: A multiplicative approximation scheme for probabilistic inference.. Proceedings of Advances in Neural Information Processing Systems 21 (NIPS'08).2008. pp. 1761–1768.
  • Yedidia JS, Freeman WT, Weiss Y. Generalized belief propagation. Tech. rep. TR2000-26, Mitsubishi Electric Research Laboratories. 2000
  • Yedidia JS, Freeman WT, Weiss Y. Generalized belief propagation. Advances in Neural Information Processing Systems 13 (NIPS'00) 2001:689–695.
  • Yedidia JS, Freeman WT, Weiss Y. Constructing free energy approximations and generalized belief propagation algorithms. IEEE Transactions on Information Theory. 2005;51:2282–2312.
  • Yuan C, Druzdzel MJ. An importance sampling algorithm based on evidence pre-propagation.. Proceedings of the Nineteenth Conference in Uncertainty in Artificial Intelligence (UAI'03).2003. pp. 624–631.
  • Zhang NL, Qi R, Poole D. A computational theory of decision networks. International Journal of Approximate Reasoning. 1994;11:83–158.