PMCCPMCCPMCC

Search tips
Search criteria 

Advanced

 
Logo of springeropenLink to Publisher's site
Evolutionary Intelligence
 
Evol Intell. 2010 August; 3(2): 79–101.
Published online 2010 July 21. doi:  10.1007/s12065-010-0040-1
PMCID: PMC2987561

Recombination operators and selection strategies for evolutionary Markov Chain Monte Carlo algorithms

Abstract

Markov Chain Monte Carlo (MCMC) methods are often used to sample from intractable target distributions. Some MCMC variants aim to improve the performance by running a population of MCMC chains. In this paper, we investigate the use of techniques from Evolutionary Computation (EC) to design population-based MCMC algorithms that exchange useful information between the individual chains. We investigate how one can ensure that the resulting class of algorithms, called Evolutionary MCMC (EMCMC), samples from the target distribution as expected from any MCMC algorithm. We analytically and experimentally show—using examples from discrete search spaces—that the proposed EMCMCs can outperform standard MCMCs by exploiting common partial structures between the more likely individual states. The MCMC chains in the population interact through recombination and selection. We analyze the required properties of recombination operators and acceptance (or selection) rules in EMCMCs. An important issue is how to preserve the detailed balance property which is a sufficient condition for an irreducible and aperiodic EMCMC to converge to a given target distribution. Transferring EC techniques to population-based MCMCs should be done with care. For instance, we prove that EMCMC algorithms with an elitist acceptance rule do not sample the target distribution correctly.

Keywords: Evolutionary Markov chain Monte Carlo, Detailed balance, Recombination, Acceptance rules

Introduction

Markov Chain Monte Carlo (MCMC) is a framework of algorithms for sampling from complicated distributions. The use of MCMC in Machine Learning has recently been advocated by [1]. Usually, a single MCMC is run until it converges to the stationary distribution. To improve their efficiency, some MCMC variants consist of a population of chains that interact by exchanging useful information and at the same time preserve the MCMC convergence characteristics at the population level. In this paper, we are particularly interested in techniques that use multiple interacting chains in parallel as opposed to a single chain.

The stochastic process of Evolutionary Computation (EC) and MCMC algorithms is basically similar: both are Markov chains with fixed transition matrices between individual states, for instance transition matrices given by mutation and recombination operators for EC and by perturbation operators for MCMC. Furthermore, both Metropolis-Hastings—a subclass of MCMCs—and EC algorithms have a selection step, the acceptance rule, to propagate good individuals to the next generation. There are also many differences induced by the different scope of these algorithms: EC is used for optimization and MCMC is used for sampling. Additionally, MCMC uses a single chain whereas EC algorithms use a population of individuals that interact. Motivated by the common points of these two algorithms, we have previously discussed the Evolutionary MCMC (EMCMC) framework which aims to improve the efficiency of standard MCMC algorithms [7, 8]. EMCMC is a population-based MCMC that exchanges information between the individual chains such that at population level it is still an MCMC.

In general, it is not straightforward to integrate interaction between chains, like recombination or selection, into population based MCMCs and to preserve the convergence to the target distribution. To ease proving that EMCMCs converge to the stationary distribution the individuals generated with recombinative operators (an alternation between mutation and recombination operators) should be all accepted or all rejected [8, 16] with a so called coupled acceptance rule. Note the difference between this coupled acceptance and the popular selection strategies in EC; the coupled acceptance rule is selective at the family (i.e., the set of children generated by a set of parents) level whereas the selection strategies are selective at individual level that is one of the children competes against one of its parents. Using the standard MH acceptance rule where only one of the multiple children generated from multiple parents is accepted/rejected is a straightforward alternative algorithm [27]. It is interesting to note that Mahfoud and Goldberg [17] also obtained good results for Simulated Annealing (SA) [14] algorithms where one child competes against one of the parents. However, such a recombinative EMCMC does not fit in the standard framework of Metropolis-Hastings algorithms. Some alternative solutions proposed previously restrict the proposal distributions that generate new individuals by generating only one child at the time from a family of parents [3, 5, 15, 23, 24]. For example, [15] proposed an EMCMC algorithm that uses a population based univariate distribution to sample from likely Bayesian network structures. Other algorithms, for example some population-based adaptive MCMCs [9] and sequential Monte Carlo [6], relax the Markov property at the price of more difficult convergence properties and usage by practitioners.

In this paper, we theoretically and experimentally study various recombination operators and their interaction with acceptance rules resulting into EMCMCs with a required target distribution. We investigate the properties of several popular recombination operators in GAs (i.e., uniform recombination) when integrated in the EMCMC framework. We show that the individuals that interact in generating candidate individuals should also interact in the acceptance rule to sample from the target distribution. Acceptance rules that are directly derived from the EC’s selection strategies are more useful for optimization than for sampling. The sampled distribution is skewed compared with the target distribution: the fit states of the search space are amplified and the less fit states are diminished.

We propose a general method that corrects the target distribution of a recombinative EMCMC that does not sample from the intended distribution. This technique simply considers the recombinative EMCMC as the proposal distribution and the generated children are all accepted/rejected with a coupled acceptance rule. In this way we postpone the acceptance or rejection of all children with the hope that the recombinative EMCMC generates fit individuals that will increase the chance that children are accepted and, consequently, that the algorithm converges faster to the target distribution. This method has theoretical value constructing a correction term with which the sampled distribution should be multiplied to transform it into the target distribution.

We compare in practice the performance of various recombinative and non-recombinative EMCMCs with the standard and the population-based MCMC. When comparing (E)MCMCs we respond to three questions: (1) how useful are EMCMCs when compared with MCMCs, (2) how useful are the recombinative operators and (3) what is the difference in performance between EMCMCs using the standard MH acceptance rule selective at individual level and EMCMCs using the coupled acceptance rules. The recombinative operators chosen are the most popular operators in EC: discrete space uniform recombination and uniform mutation. As a consequence, the theory and the practical examples are formulated for the discrete space (E)MCMCs. We also mention that it is straightforward to extend these results to the continuous space (E)MCMCs.

For our first experiment we analytically compare the algorithms an a toy example such that the exact performance of algorithms is calculated from all the transitions between all the states of an (E)MCMCs. In the second experiment we calculate the Kullback-Leiber distance between the target distribution and the distribution output by an algorithm after a finite number of steps on a relatively small size binary quadratic programming problem (BQP) to exactly compute the target distribution. The next experiment is on a larger size BQP where we can compare the performance of (E)MCMCs using only graphical (and more imprecise) tests. Note that BQP is related to the popular mathematical problem in statistical mechanics known as the Ising model [10]. The obtained results show that recombination improves the mixing of the EMCMC especially when the standard MH acceptance rule is used with recombination.

Outline of the paper

Section 2 presents some basic knowledge of MCMC algorithms and introduces the notation used in the rest of the paper. For an in depth study on MCMCs we refer the reader to [12]. In Sect. 3 the EMCMC framework is presented. In Sect. 4 we investigate several recombination operators and their desired properties for EMCMCs. Section 5 proposes and analyzes various MH acceptance rules and the properties of the resulting EMCMCs when combined with the recombinative operators. We also establish rules to design recombinative EMCMCs for sampling and optimization. In Sect. 6 we analytically investigate the discussed EMCMCs on a toy problem and experimentally test them on two BQP problem instances. Section 7 concludes and discusses the results of the paper.

Background: MCMC framework

MCMC is a general framework to generate samples X(t) from a probability distribution P(·) while exploring its so-called countable [ell]-dimensional state (or search) space E using a Markov chain. We assume the state space is compact. MCMC does not sample directly from P(·), but only requires that it can be evaluated within a multiplicative constant equation M1, where Z is a normalization constant and equation M2 the unnormalized target distribution. A discrete time Markov chain is a stochastic process (X(0)X(1), …) with the property that the probability distribution for the state X(t) given all previous values (X(0)X(1),…, X(t−1)) only depends on X(t−1). Mathematically, we can write

equation M3

We call equation M4 the transition matrix of the Markov chain. A homogeneous Markov chain in addition, has a time-independent transition matrix. In the following we only consider homogeneous Markov chains, unless specified otherwise. Aperiodicity excludes for instance that certain points can only be reached at even times. For any starting point a Markov chain with a finite state-space converges to a unique invariant distribution if it is irreducible and aperiodic. A Markov chain is called irreducible if, and only if, every state can be reached from every other state in a finite number of steps.

A sufficient, but not necessary, condition to ensure that the given distribution P(·) is the stationary distribution is that it satisfies the detailed balance condition [1]. A MCMC satisfies the detailed balance condition if, and only if, the probability to move from X to Y multiplied by the probability to be in X is equal to the probability to move from Y to X multiplied by the probability to be in Y:

equation M5

Metropolis-Hastings algorithms

Many MCMC algorithms are Metropolis-Hastings (MH) algorithms [13, 18]. Since we cannot sample directly from the distribution P(·), MH algorithms consider a simpler distribution equation M6, called the proposal distribution to generate the next state of a MCMC chain. equation M7 generates the candidate state Y from the current state X(t), and the new state Y is accepted with probability:

equation M8

If the candidate state is accepted, the next state becomes X(t+1) = Y. Otherwise, X(t+1) = X(t). For finite search spaces, the transition probability equation M9 for arriving in Y when the current state is X(t), where X(t) ≠ Y, is

equation M10

The rejection probability is,

equation M11

An MH algorithm is aperiodic, since the chain can remain in the same state with a probability greater than 0, and by construction it satisfies the detailed balance condition,

equation M12

If, in addition, the chain is irreducible, then it converges to the stationary distribution P(·). The rate of convergence depends on the relationship between the proposal distribution and the target distribution: the closer the proposal distribution is to the stationary distribution, the faster the chain converges. A popular Metropolis-Hastings algorithm is the Metropolis algorithm where the proposal distribution is symmetricalequation M13 and the acceptance rule becomes

equation M14

Mutation

A popular and often used set of irreducible proposal distributions for MH algorithms can be described by a mutation operator. We generically denote the proposal distributions resulting from mutation operators with Qm. We consider a state in the discrete space as a string of [ell] characters, X = (X1X2,…, X[ell]). The h-th position in X is called the locus of Xh, where 1 ≤ h ≤ [ell], and the value of X in the locus h is called an allele. Each position (or locus) h in an individual X is instantiated with an alleleXh  [set membership] E(X·), where E(X·) is the multi-set of all possible values of X·.

The uniform mutation operator randomly changes every value of each variable of the current state with a non-zero probability, called the mutation rate [8, 16, 17, 23]. The bigger the uniform mutation rate, the bigger the jump in the search space of the child state from the parent state. Qm denotes the uniform mutation proposal distribution. When the context is not ambiguous, we simply refer to it as mutation. The uniform mutation operator defines an irreducible, symmetric and stationary proposal distribution [8].

In the sequel, the uniform mutation transition matrix, Km, proposes candidate individuals with Qm and accepts them with the MH acceptance rule. The uniform mutation transition matrix, Km, defines an irreducible MH algorithm which converges to its stationary distribution [8].

Multiple independent chains (MICs)

When we talk about the performance of an MCMC, we refer to how well an MCMC is mixing or how “fast” it converges to the target distribution. We say that an MCMC is mixing “well” if it rapidly traverses the search space and, at the same time, accurately samples the target distribution. Note that the mixing concept in MCMC is not related to the mixing of building blocks in the EC literature.

In an attempt to improve the mixing behavior of MCMCs one could make use of multiple chains that run independently (MICs). The chains are started at different initial states and their output is observed at the same time. It is hoped that this way a more reliable sampling of the target distribution P(·) is obtained. It is important to note that no information exchange between the chains takes place.

Recommendations in the literature are conflicting regarding the efficiency of multiple independent chains. Yet there are at least theoretical advantages of multiple independent chains MCMC for establishing its convergence to P(·) [12]. Let’s consider a large dimensional distribution where an MCMC takes a long time to find a relevant region of the search space and to escape from it to search for other relevant regions. Then, the time necessary for a long MCMC can be larger than just starting multiple MCMCs spread over the search space sampling in different regions. However, MIC converges only after all the component MCMC chains have converged.

Since the chains do not interact, MIC is at the population level an MCMC with transition probabilities equal to the product of component chains transition probabilities, or

equation M15

where equation M16. If the MCMCs have detailed balance, are irreducible and aperiodic, then MIC inherits these properties and it converges, at the population level, to the product of their target distributions, P1(·)×(...)×PN(·), where Pi(·) is the target distribution of the i-th chain.

EMCMC framework

EMCMCs use a population of chains that allow interactions between the individuals under the assumption that individuals in the current population exchange information that helps the EMCMC to sample the desired distribution. Note that, in EMCMCs, the population is a multi-set of individual states rather than a collection of MCMCs: the current individual states depend on several states from the previous population. Now the sample at time t is the population equation M17 of N states (or individuals) equation M18.

Definition 1

An evolutionary Markov chain Monte Carlo (EMCMC) algorithm is a population MCMC that exchanges information between individual states such that, at the population level, the EMCMC is an MCMC.

Similarly to an MCMC, the main goal of an EMCMC is to sample from a given distribution, P(·). Ideally, an MCMC algorithm generates individuals directly from the target distribution. Unfortunately, we do not know where the most likely—or equivalently, the most fit—individual states can be found in the search space. Furthermore, MCMCs can poorly “mix” when individual states are disproportionately proposed with their probability. A standard MCMC, for example, generates individuals with some mutation proposal distribution (e.g., the uniform mutation proposal distribution Qm) that does not have any knowledge of the sampled distribution. A method to speed up the mixing is to propose individuals using proposal distributions that are “close” to the target distribution. For that, we can use recombination operators that exploit the common structure of the parents. Sampling from a distribution implies that the more fit individuals are more often generated than less fit ones. As a consequence, the commonalities of more likely individuals are used by recombination to create other more likely individuals. Intuitively, such a proposal distribution approximates better the target distribution than a proposal distribution that does not make any assumption about the generated individuals, like uniform mutation. In this perspective, the recombination operators adapt the proposal probabilities to generate an individual from the current population. Note that, the allowed types of proposal distribution are the ones that preserve the Markov chain property at the population level: we can only use the information in the current population for generating new individuals.

Recombination operators in EMCMCs

We call EMCMCs that use recombination to exchange information between individuals recombinative EMCMCs.

Definition 2

A recombination operator used as proposal distribution of an EMCMC generates one or more children from two or more parents using some function that is independent of the EMCMCs’ sampled distribution. Each generation, the population is uniform randomly grouped in disjunct families of few (i.e., two, three) individuals such that each individual belongs to exactly one family. All the chains from an EMCMC eventually interact in population recombinations. We call recombination proposal distribution, Qr, the distribution defined by the recombination probabilities at the population level.

It is important to note that at the individual family level, the proposal probabilities of recombination are not stationary since they depend on the family members with which they are grouped. At population level, however, the recombination proposal distribution generating the next population from the current one is stationary.

We only consider recombination operators that are respectful—this is, the common substructures of the parents are inherited by the offspring [20]. With respectful recombination the common parts of strings are protected against disruption.

An important aspect of any recombination operator is to establish whether it is symmetrical or not: for non symmetrical recombinations, we have to compute the proposal probabilities, whereas for symmetrical operators we can simply use the Metropolis algorithm. In Sect. 4.1 we design and investigate several recombination operators that generate symmetrical proposal distributions and in Sect. 4.2 we give examples of recombination operators that generate non-symmetrical distributions. We focus on the most popular type of recombination operators in GAs that swap alleles between two or more parents with some probability to generate one or more children. Since respectful recombination by definition is reducible [8], in Sect. 4.3 we combine recombination with mutation to obtain irreducible proposal distributions following the simple mathematical rules of mixtures and cycles [8].

The MH acceptance rules

The recombination operators usually have no information about how fit the individuals in the current and proposed population are. Then, like for the standard MCMCs, we need acceptance rules to sample from the target distribution. Detailed balance is a sufficient, but not a necessary condition, for an irreducible aperiodic EMCMC to converge to a desired target distribution P(·). By definition, MH algorithms are aperiodic and have detailed balance. Most EMCMCs are irreducible MH algorithms—by use of mutation—and apply recombination in the proposal distribution.

In Sect. 5.1 we propose an EMCMC where individuals are generated with recombinative proposal distributions and the parents and children are competing in a Metropolis-Hasting acceptance rule. Such an EMCMC has detailed balance if and only if the individuals that interact through recombination also interact in the acceptance rule. We further call these acceptance rules where two or more chains interact the coupled acceptance rule. We prove that such an algorithm is ergodic—that is irreducible and aperiodic—with the stationary distribution P1(·)×(...)×PN(·), where Pi(·) is the target distribution of the i-th chain. However, such a coupled acceptance rule has a negative effect on the performance of an EMCMC. If some children are fit individuals but the others are not, this acceptance rule can reject “good” individuals whereas the standard MH acceptance rule will always accept them.

We investigate the convergence properties of recombinative EMCMCs using variations of the Metropolis-Hasting acceptance rule. In Sect. 5.2 we prove that the recombinative population-based MCMCs that accept/reject each candidate state using the standard Metropolis acceptance rule does not have detailed balance. Its advantage is that the probability of accepting at least one individual of this EMCMC is larger than the acceptance probability of an EMCMC using the coupled acceptance rule. In Sect. 5.3 an example of an MH acceptance rule derived from the elitist replacements selection strategy [25] is designed. The sampled distribution is even more skewed towards probable states and the acceptance probability of one individual is even larger. In Sect. 5.4 we propose and analyze a methodology, we call it nested EMCMC. It corrects the sampled distributions of skewed EMCMCs by accepting/rejecting all the individuals generated with the EMCMCs with the coupled acceptance rule. This nested EMCMC has detailed balance even though the initial EMCMC does not.

Recombinative proposal distributions for EMCMCs

In this section we propose and analyze various recombinative proposal distributions and their properties for EMCMCs that sample from the desired target distribution.

Symmetrical recombinations

In EMCMCs, the symmetry is obtained by preserving the distance between the parents and their children. For example, the distance between N children is equal with the distance between the N parents that generate the children, or the distance between a parent and its child is constant as compared with the distance between two other individuals in the population.

N parents generate N children

When the distance, i.e. Hamming distance, between the generated children is the same as the distance between their parents, the recombination operator is symmetrical.

Proposition 1

ConsiderNparents uniform randomly chosen without replacement from the current population, equation M19, and an associated distance metricequation M20with

equation M21

whereequation M22. Let the recombination operator whereNcandidate individuals, {yi,…, yi+N−1}, are generated by swapping alleles of parents such that the corresponding proposal probabilityequation M23is a function of the distance between parents such that the distance between parents is equal with the distance between their children

equation M24

then Sr is symmetrical.

Proof

The probability to generate children from their parents is Sr is a distance function fE2 → R between parents

equation M25

Thus, this recombination is symmetrical.equation M26

Note that if the number of children is different from N, in general, the symmetry condition does not hold. We discuss such examples in the next section.

The swapping recombinations, often used in EMCMCs and the standard GAs, are particular cases of the above proposition where the distance between individuals are kept constant by swapping alleles.

Proposition 2

Recombination proposal distributions which swap parts of individuals in between chains using a uniform distribution are symmetrical, respectful and stationary.

Proof

Since there are equal probabilities to swap alleles (parts) in between parents and in between children, this recombination is symmetrical and the distance between them remains equal. If the parents have the same allele on a locus, so do the children since the swapping does not change the values of alleles. equation M27

We have recombinations which exchange non-common alleles, e.g., uniform crossover, or parts of individuals, e.g., 1 and 2 point crossover [16, 17]. These recombinations are often used only with two parents.

In binary spaces, an example of swapping recombination is parameterized uniform crossover, Qunif, which generates two candidate individuals by swapping alleles between two parents with a uniform probability, px. Thus, it is impossible to generate children that have other common alleles than their parents. Where the two parents differ, an allele is swapped with the probability px and is not swapped with the probability 1 − px. It is interesting to observe that the time complexity to generate two children from two parents with Qunif, like for uniform mutation, is linear with the dimensionality, equation M28.

For px = 0.5, the operator is called uniform crossover and is used with all codings: for strings of bits [16] and for strings of real numbers [12].

Three parents generate one child

In the following, we introduce a general condition to design symmetrical recombinations using three parents which generate one child.

Proposition 3

Consider three parents uniform randomly chosen without replacement from the current population, equation M29. The recombination operator where a candidate individual, yi, is generated from the three parents such that the total distance between parents is equal with the total distance between the candidate individual andequation M30

equation M31

is symmetrical, whereequation M32is a distance metric.

Proof

The parent equation M33 and the child yi are interchangeable; they have the same total distance with the other two parents. Thus, this recombination is symmetrical.equation M34

As an example in the binary space, we propose the total difference crossover, Qdif. This type of recombination is imported from real coded EAs [22] and EMCMCs [24]. The new individual, yi has the same alleles like equation M35 on the positions where the two other parents coincide. On the other positions, we flip the alleles of equation M36 with the probability px.

Corollary 1

Qdifis symmetric, respectful and stationary. The time complexity ofQdif, like forQunif, is linear with the dimensionality, equation M37.

The xor crossover [23] is a special case of Qdif where the probability to flip a bit is 1 for equation M38’s bits where equation M39 and equation M40 disagree.

The main difference between the two symmetrical types of recombination is that one preserves the sum of distances between the three parents when generating a child and the other preserves the distance between two parents when generating two children.

Family versus population recombination operators

Given the number of chains that interact, we distinguish between family and population recombinations. Recombining few chains (e.g., two or three chains) is an example of the first approach, while in the latter all chains from the population exchange information. The above recombination proposal distributions are all family recombinations.

We assume that, for family recombination, each generation, the population is uniform randomly grouped in disjunct families such that each individual belongs to exactly one family. All the chains from an EMCMC, eventually, interact in population recombinations. We call recombination proposal distribution the distribution defined by the recombination probabilities at the population level. We denote it with Qr. In the case of an individual at the family level, the proposal probabilities of recombination are not stationary since they depend on the family members with which they are grouped. At population level, the probability to generate with recombination one population from another one is stationary.

For the above family recombinations (e.g., Qunif and Qdif), the time complexity at the population level is linear with the number of individuals in the population: each generation, each individual is randomly paired in exactly one family. The complexity of these recombination proposal distributions at the population level therefore is equation M41. Note that, at the population level, the complexity of the mutation proposal distribution depends linearly on the number of individuals in the population equation M42.

Non-symmetrical proposal distributions

We investigate two non-symmetric recombinations where the alleles are exchanged between parents but, this time, the distance between parents and children is not preserved

The masked recombination

This recombination swaps the alleles between two parents like the parameterized recombination but it generates one child instead of two. Then, the distance between parents, in general, is not same with the distance between the child and one of the parents. Thus, the recombination is not symmetrical. We call this recombination the masked recombination, Qmask.

A child yi is generated from a parent, equation M43, and a mask, equation M44. The common alleles of equation M45 and equation M46 are passed to yi, but the non-common alleles are flipped in equation M47 with the rate px. Note that this crossover and the parameterized uniform crossover have the same probabilities to generate one child. But, Qunif is symmetrical and Qmask is not symmetrical, because Qmask generates only one child. Consequently, we have to compute the probabilities to generate a candidate individual with Qmask in the acceptance rule of the MH algorithm. Qmask also resembles Qdif where two parents are identical. However, by replacing the identical parents with the child in the candidate generation, the symmetry condition does not hold.

Proposition 4

Qmaskis reducible and stationary. Consider that from a parentequation M48and a maskequation M49we generate a child, yiwithQmask. Then, Qmaskis non-symmetrical. The time complexity to generate a child withQmaskis linear with the string sizeequation M50.

Proof

Let’s consider that equation M51 because bits are flipped on some positions. In those positions, the mask equation M52 and the child yi have the same values, whereas equation M53 and equation M54 do not. Then, it is impossible to generate equation M55 from yi and equation M56. The rest of the properties follow directly. equation M57

Recombination using probabilistic models

This recombination builds a probabilistic model of the parents to generate the children. It is analogous with the operator that generates individuals for the estimation distribution (EDA) algorithms applied in Evolutionary Computation for solving optimization problems [19].

We propose the tree frequencies probabilistic recombination, Qtree, closely related with the probabilistic model of Baluja [2]. Unlike the previous recombination operators where an allele is generated only given the alleles on the same position, Qtree considers the dependencies between two positions in the population using the Chow and Liu [4] algorithm.

In the following, we describe the algorithm we use to generate individuals with Qtree. This algorithm constructs from the population of current individuals a tree with maximum entropy using a mutual information function. The entropy describes the level of uncertainty in a statistical variable. Here, the frequencies of the alleles in a position define a statistical variable for that position. Mutual information captures the extent to which two statistical variables are dependent. This algorithm keeps adding dependencies between variables based upon their mutual information under the constraint of building a tree (e.g., there are no cyclic paths between variables). The higher the mutual information is, the sooner the algorithm tries to add the dependency in the tree.

A root for this tree is chosen at random from the set of positions. The allele for the root position is chosen based on its frequencies in the current population. We iteratively generate the other alleles based on their dependency with an allele—called parent—which was already instantiated in the tree. If h is the root of the tree, then the allele yih is generated using the distribution N(yih)/N, where N(yih) is the number of alleles yih in the current population. Otherwise, if h has the parent h1 in the tree, then the allele yih is generated with the probability equation M58 where equation M59 is the allele already generated in position h1, and equation M60 is the number of individuals in the current population that have allele yih on position h and allele equation M61 in position h1.

We observe that Qtree is the most expensive recombinative proposal distribution we have investigated for EMCMC. Unlike the other discrete space recombinations, Qtree exploits some relationships between dimensions: it computes the dependencies between two positions in order to construct the tree of maximum entropy and to assign a value to an allele. Then, the generation of an allele on a position also depends on the alleles on another position.

Proposition 5

Qtreeis respectful, non-symmetrical, stationary and biases the exploration according to the non-linear correlations between dimensions. The computational complexity to generate a child withQtreeisequation M62, wherelis the dimensionality andNthe size of the population.

Proof

When an individual is generated with Qtree and replaces a parent, some allele frequencies can increase at the cost of the others. The computational complexity of this operator is given by building the maximum log-likelihood tree. Chow and Liu [4] show that this is equation M63. equation M64

Qtree is a generalization of Laskey and Myers [15]’s recombination proposal distribution; when generating an allele, they consider only the frequencies of the alleles on the same position and not also on the other positions as Qtree does. Therefore, their recombination, unlike Qtree, does not exploit the relationships between dimensions.

Irreducible recombinative proposal distributions

Since respectful recombination by definition is reducible, in the following, we study how to combine it with mutation to obtain irreducible proposal distributions. We combine the proposal distributions following the same simple mathematical rules as for transition distributions. We study the properties (like symmetry and irreducibility) of the resulting proposal distributions. We show some examples where the properties of the component proposal distributions are inherited by the complex proposal distribution. However, in general, we have to check the properties for each distribution.

We combine mutation and recombination in mixtures and cycles.

Definition 3

A mixture of proposal distributions is a probabilistic sum of proposal distributions where each step one distribution is selected according to some constant positive probability. A cycle of proposal distributions is the product of proposal distributions where in each step one distribution is used in turn in a specific order.

Mixtures

Proposition 6

In a mixture of proposal distributions, if one distribution is irreducible, then the mixture is irreducible. A mixture is symmetrical if the component distributions are symmetrical. A mixture is stationary if all component distributions are stationary.

Proof

If one distribution is irreducible, then there exists a non-zero probability to generate any population from any other population. The rest of the properties follows directly.equation M65

For example, the following mixture

equation M66

is irreducible when pr < 1, and symmetrical when the recombination is symmetrical. Note that the above operator is equivalent to recombination, Qm+r = Qr, for pr = 1; then, like recombination, Qm+r is reducible. Note that the computational cost of a mixture of proposal distributions is driven by the most expensive component proposal distribution. Furthermore, a mixture exploits some relationships between dimensions if a component proposal distribution does.

Cycles

Unlike for mixtures, for cycles, there are no rules for irreducibility or symmetry. They have to be checked for each cycle. Cycles of proposal distributions are common for the standard GAs where one considers first mutation and then recombination, Qm×r, or first recombination and then mutation, Qr×m.

equation M67

In general, since two matrices usually do not commute, Qm×r and Qr×m are non-symmetrical.

Proposition 7

Qm×randQr×m, are symmetrical for any recombination that swaps alleles [17]. Qm×difandQdif×mare non-symmetrical. Qm×randQr×mare irreducible and stationary. If the recombinationQris symmetrical, we have

equation M68

To ease the reading of the paper, we give the proof for the above proposition in Appendix 1.

Parallel Recombinative Simulated Annealing (PRSA) [17] uses recombination that swaps alleles followed by mutation. Note that it is impractical to compute the probabilities of a cycle: we have to sum over all possible intermediate populations. Therefore, in general, it is impractical to use non-symmetrical cycles. In the following, we show two cycles where the above non-symmetrical recombinations are efficiently combined with uniform mutation directly on each position of an individual.

Consider a parent equation M69 and a mask equation M70 chosen at random from the population. Like for Qmask, for the non-common values of the two parents, equation M71 is flipped with the probability px to generate the child yi. Unlike for Qmask, for the common parts of these parents, equation M72 is flipped with the low probability 1/[ell] to generate the child yi. We generate from the mask equation M73 a second child yi+1 with the uniform mutation with the mutation rate pm. We denote this proposal distribution with Qm×mask where

equation M74

In the next proposition we show that Qm×mask, unlike Qmask, can be used with an MH algorithm. Furthermore, although it is a cycle, its computational time is similar with the one of uniform mutation.

Proposition 8

Qm×maskis irreducible. Qm×maskis symmetrical ifpm = 1/2 orpx = 1/[ell]. Ifpm ≠ 1/2 andpx ≠ 1/[ell] thenQm×maskis non-symmetrical. The time complexity ofQm×maskis linear with the string sizeequation M75.

The prove is given in Appendix 2 to ease the reading of the paper.

Similarly, we combine the tree frequencies probabilistic recombination, Qtree, with the uniform mutation in a cycle to be able to use it with the MH algorithm. We first construct the maximum entropy tree. We choose at random a position, h, which we consider the root, we propose an allele yih with the probability equation M76. Iteratively, we propose an allele yih with the probability

equation M77

where the allele on the position h1, yih_1, is already instantiated. We denote this operator with equation M78. Like Qtree and unlike the other proposal distributions, Qm×tree exploits some relationships between different dimensions.

Proposition 9

Qm×treeis irreducible and non-symmetrical. The time complexity to generate an individual withQm×treeisequation M79, where [ell] is the string size andNthe population size.

Proof

The proof is immediate.equation M80

In Table 1 we present the operators composed from mutation and/or recombination, their irreducibility, their symmetry, and their number of parents compared with the number of children.

Table 1
Properties of several mutation/recombination operators: if they are irreducible or not, symmetrical, and how many children are generated from how many parents

MH acceptance rules for recombinative EMCMC

In this section we propose various MH acceptance rules and we discuss the properties of EMCMC algorithms resulting from the interaction between the recombinative operators and these acceptance rules.

Detailed balance: all children accepted or all rejected

We establish that the EMCMCs that generate individuals with irreducible recombinative proposal distributions and accept/reject them all has detailed balance and the target distribution for this EMCMC.

Theorem 1

Consider the EMCMC algorithm that proposesN ≥ 2 children, Y = (y1,…, yN) fromNparents, equation M81using a irreducible proposal distributionQthat is independent of the target distribution. All children are accepted or all children are rejected with the probability

equation M82

This EMCMC is ergodic with unique stationary distribution P1(·)×(...)×PN(·), where Pi(·) is the unique stationary marginal distribution for the ith chain, ∀i = 1,…, N.

The prove is given in Appendix 3 to ease the reading of the paper.

Note that the EMCMC resulting from the interaction between the proposal distribution Q and the MH acceptance rule αC is an MCMC over the N dimensional search space EN. We denote the transition matrix for this EMCMC algorithm with KC. The transition probability between a candidate state Y and the current state X(t) is equation M83 and the rejection probability is equation M84.

Two examples

The coupled acceptance rule. The coupled acceptance rule αC [11, 16] considers for acceptance two chains. Two children, y1 and y2, generated from two parents, equation M85 and equation M86, are both accepted or rejected with the coupled acceptance rule equation M87.

When αC is associated with one of the irreducible recombinative proposal distributions—for instance Qm×unif and Qm+unif that generates two children from two parents—according with Theorem 1, the EMCMC algorithm has detailed balance and samples from the target distribution P1(·)×P2(·).

Corollary 2

Consider the EMCMC algorithm that proposes two children from two parents using an irreducible proposal distributionQand accepts/rejects the children using the coupled acceptance rule αC. We denote the corresponding transition matrix withKC. This EMCMC converges toP1(·)×P2(·), wherePi(·) is the marginal distribution of thei-th chaini = 1,2.

However, in practice, such an acceptance rule is not always desired, since it is not selective at individual level. For example, usually, individuals with higher and lower probabilities are proposed; with αC the fit individuals can be rejected and the acceptance of less fit individuals depends on the family’s fit individuals.

Note that the target distribution of this EMCMC is given by the product of distributions in the MH acceptance rule. By replacing this product with other mathematical functions (e.g., maximum of two values as in the next example), the corresponding EMCMC converges to a different distribution.

The order two statistics acceptance rule

To sample from the order two statistics distribution

equation M88

we create a variant of the coupled acceptance rule

equation M89

where max is the maximum for the values of two individuals, and equation M90 is any proposal distribution.

According to Lemma1, an EMCMC that proposes two candidate individuals from two parents and accepts/rejects them both with α2:1 has detailed balance. If the proposal distribution is also irreducible, this EMCMC converges to the stationary distribution P2:1(·,·).

Detailed balance at population level

Most EMCMCs use family recombinations where, each generation, all individuals are randomly grouped such that each individual belongs to exactly one group. If the children generated with recombination are all accepted or all rejected with an acceptance rule as suggested in Theorem 1, we obtain family transition probabilities with detailed balance. At individual or family level, these transitions are not MCMCs, since their proposal probabilities are not stationary—they depend on how the individuals are grouped. At population level, for all possible groupings of the current population, the transition distribution is stationary. Then, the population transition probabilities obtained by combining the family transitions have detailed balance and define an MCMC.

The standard MH acceptance rule in recombinative EMCMCs

In the following, we investigate the properties of EMCMCs that use irreducible recombinative proposal distributions and the standard MH acceptance rule. Such an EMCMC does not fit in the standard MH framework where the individuals that interact in the proposal distribution also interact in the acceptance rule. For this EMCMC individuals interact in the proposal distribution but children are accepted/rejected individually given only one parent.

To ease the reading, we consider that two children, y1 and y2, are generated with a symmetrical proposal distribution Q from two parents equation M91 and equation M92. Each child is accepted/rejected given one of the parents, randomly chosen without replacement, with the standard Metropolis acceptance rules, equation M93. Let’s denote with K1.1 the resulting transition matrix. The transition probability to accept both children is

equation M94

The transition probability to accept only one child (i.e., y1) and to reject the other child is

equation M95

The rejection probability of both candidate states is

equation M96

To analyze the behavior of this EMCMC, we compare its transition distribution with KC, which we showed in Theorem 1 that it converges to the target distribution. We show that even though the acceptance and rejection transition probabilities are similar, KC samples from the target distribution and K1.1 does not.

Proposition 10

Consider the two transition distributionsKCandK1.1, the coupled acceptance rule αC, the standard Metropolis acceptance rule α as before. Let’s further consider two parentsequation M97and their two children (y1, y2) generated with an irreducible symmetrical proposal distributionQ.

The probability to accept a child that it fitter than one of its parents, equation M98, is higher forK1.1than forKC

equation M99

The probability to reject a child less fit than one of its parents, equation M100, is higher forK1.1than forKCwhen the second child is more fit than the second parent, equation M101

equation M102

The probability to reject a child less fit than one of its parents, equation M103, is lower forK1.1 than forKC when the second child is less fit than the second parent,equation M104.

The EMCMC algorithmK1.1 has detailed balance if and only if the probability to generate two children from two parents is equal with the probability to generate one child and one parent from the other parent and the other child

equation M105
1

If Eq. 1 holds, the algorithm converges to the target distributionP1 (·)  · P2(·).

Again, to ease the reading, we prove this theorem in Appendix 4.

According to the above proposition, an MH algorithm that accepts/rejects with the standard MH acceptance rule some, not all, of the individuals generated with some recombinative proposal distribution does exhibit detailed balance only for some particular types of recombinations.

Equation 1 holds, for example, for uniform mutation distribution Qm and symmetrical recombination distributions that generate one child [8, 23]. It does not hold for other symmetrical recombinations that generate two or more children, like for example, uniform recombination. With uniform recombination for any four individuals, we have

equation M106

Unfortunately, if the detailed balance condition does not hold, there is no standard method to know the target distribution.

It is interesting to notice that the MH algorithms generated with K1.1 have a higher probability of acceptance of at least one candidate state than the algorithms generated with KC that accept or reject all individuals at once. As a consequence, for the same proposal distribution, the algorithm determined by K1.1 samples faster than an algorithm that uses KC.

The elitist coupled acceptance rule

In this section we investigate an acceptance rule inspired from the elitist replacement strategy [25] which does not have detailed balance regardless of the proposal distribution used. Furthermore, we show that the marginal distribution of the generated EMCMC is different from the target distribution being amplified for the fit individuals and diminished for the less fit individuals.

The elitist coupled acceptance rule (ECA) algorithm is a family competitive acceptance rule where the best two solutions from the family of four is kept if at least one of them is a child. Otherwise, when both children have a lower fitness than both their parents, the children can probabilistically replace the parents.

ECA can be viewed as a combination between the elitist replacement rule from regular GAs and the coupled acceptance rule αC. When compared with the elitist replacement, ECA is more exploratory but less elitist since it still accepts with some probability less fit individuals. When compared with αC and α acceptance rules, ECA is more elitist but less exploratory. With ECA, if a child and a parent are the two most fit individual states from two parents and their children, they are always accepted whereas with α the other child will be accepted with some probability.

To establish the properties of ECA’s target distribution, we compare it with KC. The probability to escape from the basin of attraction of a peak, as we show in the next paragraphs, is rather poor when compared with the transition distribution KC generated with the same proposal distribution and the coupled acceptance rule αC. The transition distribution generated by accepting with ECA the individuals proposed with the irreducible proposal distribution Q is denoted with KECA. We call max2 the function returning the two most fit solutions.

We distinguish three cases.

  1. Both children are better or worse than their parents. Then
    equation M107
    or
    equation M108
    where equation M109 and equation M110. The transition probability to accept or reject both children, {y1, y2}, proposed with the proposal distribution Q is non-zero only in this case. Then
    equation M111
    equation M112
    Note that in this case, the transition probability of ECA is equal with the transition probability of an EMCMC using the coupled acceptance,
    equation M113
  2. One of the children and one of the parents are most fit. Then, for example,
    equation M114
    The transition probability to go from equation M115 to {y1,y2} is 0.
    equation M116
    Now, KC is larger than 0 and KECA is 0.
  3. Only one parent is replaced by its child. The proposal probability where only one parent is replaced in the next generation, equation M117, is amplified with the sum over all proposal probabilities that generate a state y2 such that
    equation M118
    Then
    equation M119

For irreducible proposal distributions Q, this EMCMC algorithm is irreducible because any two individuals can be generated from any other two individuals with a non-zero probability in two iterations of the algorithm equation M120 Let’s assume again that a child y1 and one of the parents equation M121 have the largest probabilities. In one iteration

equation M122

and, for the second iteration, we also have equation M123

Following the above observation, we prove that this EMCMC converges to a stationary distribution and also it does not have detailed balance regardless of the proposal distribution. The proof is given in Appendix 5.

Proposition 11

Consider the EMCMC algorithm that generates candidate individuals using an irreducible proposal distributionQ and then accepts or rejects them with the ECA acceptance rule. This EMCMC algorithm does not have detailed balance for any non-uniform distribution Qand converges to a stationary distributionequation M124.

This algorithm is climbing towards a local optima since it is very probable that a good solution remains a long time in the population to generate better solutions. Only when both children are worse than their parents this algorithm rejects the two candidate individuals with some probability. Otherwise, ECA always accepts at least one child. As a consequence, the probability to accept at least one proposed child is the largest from all the previous acceptance rules. Thus, an algorithm that uses ECA behaves more similar to a standard GA than to a sampling algorithm. As a consequence, the target distribution of ECA is biased towards high regions of P(·): the highest fitness states are sampled more often at the expense of the lower fitness states.

Nested transition distributions: repairing the detailed balance

In the following, we propose a method to integrate the transition distributions without detailed balance in MH algorithms with detailed balance. To achieve this, we need to accept or reject all the individuals generated with an MH algorithm without detailed balance.

Definition 4

A nested EMCMC algorithm is an EMCMC algorithm where individuals are proposed using a transition distribution, and are further all accepted or all rejected by a coupled MH acceptance rule. A nested transition distribution is the transition distribution used as proposal distribution by the nested EMCMC algorithm.

Furthermore, the nested transition distribution that generates individuals with a recombination distribution is itself a recombinative proposal distribution: from two or more parents we propose two or more children.

Proposition 12

The nested EMCMC algorithm has detailed balance. The nested transition distribution composed by a respectful recombination proposal distribution and an acceptance rule is by itself a respectful recombination proposal distribution.

Proof

The proof is immediate if we consider the nested transition distribution as a proposal distribution and Lemma 1. If parents have identical values at certain positions, then the individuals generated by respectful recombination have—by definition—the same values at those positions. An acceptance rule simply selects from parents and children, therefore, the accepted individuals have the same values on those positions.equation M125

Nested transitions are, usually, non-symmetrical. Thus, we need to compute these probabilities. In Fig. 1, we graphically depict the nested EMCMC framework.

Fig. 1
Nested EMCMC framework: a candidate population equation M126 is proposed with some proposal distribution Q from the current population Xt and some children are accepted with some MH acceptance rule. These accepted children and the parents that are not replaced form ...

Examples of nested EMCMCs

CorrectingK1.1. Consider the nested EMCMC that uses as proposal distribution the nested transition distribution, K1.1 where two candidate individuals are proposed from two parents with some recombinative proposal distribution, Q, and each child competes against one of the parents randomly chosen from the population with a standard MH acceptance rule. The candidate individuals proposed with K1.1 are, at their turn, accepted with the coupled acceptance rule, αC. The nested EMCMC’s transition distribution is

equation M127

where the coupled acceptance rule is

equation M128

We observe that the nested EMCMC eliminates the influence of the proposal distribution on K1.1’s target distribution with the coupled acceptance rule, αC.

In the following proposition, we express KnEMCMC as a function of K1.1 and the proposal distribution Q. The proof of this proposition is in Appendix 6.

Proposition 13

Consider that the symmetrical proposal distributionQgeneratesy1andy2fromequation M129andequation M130. If both children are different from their parents, equation M131, the nested transition distribution is

equation M132

If only one child is different from its parent, equation M133, then

equation M134

where

equation M135

Otherwise, if both children are rejected,

equation M136

From the above proposition, we note that the difference between the two transition distributions, KnEMCMC, which samples from the target distribution, and K1.1, which does not sample from the target distribution, is given by the correction term from Eq. 2. In other words, K1.1 has to be multipled with the above correction term to sample from the target distribution. If the irreducible proposal distribution Q has the property that

equation M137

for any equation M138 and y2, the correction term is 1. In this case, according with both Proposition 13 and 10, K1.1 has detailed balance and converges to the target distribution. Note that the probability of acceptance of at least one candidate individual with KnEMCMC is smaller than with K1.1 and larger than with KC. Furthermore, K1.1, as proposal distribution, is not symmetrical and to use it in KnEMCMC, we have to compute the impractical correction term.

KCas nested proposal distribution. The coupled transition distribution KC is invariant for the nested method. The proof of this proposition is in Appendix 7.

Proposition 14

Consider that the symmetrical proposal distributionQgeneratesy1andy2fromequation M139andequation M140. The nested transition distribution is

equation M141

The coupled transition distribution KC does not need a correction term to converge to the target distribution.

Three experimental tests

We compare the performance of recombinative and non-recombinative population-based (E)MCMCs on two functions: a toy problem, the hyper-geometrical distribution, which we use to analytically compare the performance of the algorithms and a larger problem the binary quadratic programming problem (BQP). We show that recombinative EMCMCs can outperform the standard MCMCs. Furthermore, we show that the algorithms that use the coupled acceptance rule αC are less efficient than the algorithms that use the standard acceptance rule α.

We compare the performance of five MCMCs: a single chain MCMC, two non-recombinative MCMCs with two recombinative EMCMCs. We take the size of population N = 2.

  1. MCMC: one single chain MCMC that proposes new states with Qm with the mutation rate pm = 1/[ell] and accepts (rejects) them using the Metropolis acceptance rule α.
  2. MIC: 2 independent MCMCs that propose new states with Qm with the mutation rate pm = 1/[ell] and accept (reject) them using the Metropolis acceptance rule α.
  3. mutC: a non-recombinative population-based MCMC that proposes each generation 2 new states with the same Qm and accepts (rejects) all of them using the coupled acceptance rule αC.
  4. rEMCMC: generates two individuals with a cycle between Qm and parameterized uniform recombination, Qunif, with pr = 50%, and then accepts them with the standard Metropolis acceptance rule α.
  5. rEMCMCC: generates two individuals with a cycle between Qm and Qunif and then accepts them with the coupled acceptance rule αC.

As shown in previous sections, the target distribution of the three population based EMCMCs– MIC, mutC and rEMCMCC—is equation M142 and the target distribution of single chain MCMC is P(·). The sampled distribution of rEMCMC is not the target distribution but, as the experimental results show it, it approximates quite well P(·) for large search spaces and a small number of samples.

Sampling from the hyper-geometrical function

To compare MH algorithms analytically, we compute the second largest eigenvalue of the transition matrices of the corresponding (E)MCMCs. Note that the second eigen value should be small to mix well.

The tested distribution

A hyper-geometric distribution (Hyper) is

equation M143

with [ell] the string size, w the number of bits-1 in the individuals with the lowest value 0.01, individual x0 with all bits equal to 0 is the second largest peak h2, and the individual with all bits equal to 1 is the largest peak h1. We set [ell] = 8 and h1 = 1.

Results

In the first experiment, see Fig. 2a and b, we vary the distance of the lowest valued states to the optimum, w = {1, 2, 3} and we set the value of the second largest state to h2 = 0.75. In this case, the local and global optimum have a close value and we vary their basin of attraction: the greater the distance from the local optimum, the smaller the basin of attraction of the global optimum. Second, we vary h2 from 0.25 to 0.75 with a step size of 0.25 and we set w = 3. In this case, the optimum is isolated and its importance is decreasing with the height of the second largest peak. In Fig. 2a and c we show results for high mutation and swapping rates, 0.5; in Fig. 2b and d we have low mutation and swapping rates 0.125. We set the low mutation rate for the cycle Qm×Qunif to 0.125. Again, we reduce the computation costs by grouping individuals with the same number of ones and zeros in one individual because these individuals have the same fitness value and therefore, the same acceptance probability. The eigenvalues for MIC and MCMC are the same because the two MCMCs have the same acceptance rule and proposal distribution. Thus, we have chosen to show results only for one of the two algorithms.

Fig. 2
Second largest eigenvalues for Hyper-geometrical function on 8 bits, that is two blocks each of 4 bits, where a,bw = {1, 2, 3} and the peak heights are set to h1 = 1 and h2 = 0.75, and c ...

In Fig. 2a and b, for w = 2, the basin of attraction is equal for the two peaks. Then, we obtain the highest eigenvalues, and thus the worst performance, for all the four algorithms. Here we have the largest amount of low fit states that separate two narrow regions with fit individuals; a random sampler, see MCMC with mutation rate of 0.5 in Fig. 2b, is the best algorithm since it covers a large area with low equal values in short time.

For the other values of w, the basin of attraction of one of the peaks is wider than the basin of attraction of the other peak; the narrower one region is, the harder to find and sample it. For w = {1,3} we have the lowest eigenvalues and, furthermore, the highest difference between the algorithms. The non-recombinative (E)MCMCs do well because the narrow peak is reduced now to one point. The recombinative EMCMCs do better than the non-recombinative (E)MCMCs with the same acceptance rule because recombination generates with higher probability more fit individuals by combining the two building blocks of this function, In Fig. 2c and d, we observe that the performance of all the (E)MCMC algorithms varies very little with the height of the second largest peak h2. Thus, these eigenvalues are (approximatively) the same with the eigenvalues for w = {1,3} from Fig. 2a and b.

To conclude this example, we observe that due to the structure of the problem recombinative EMCMCs have provably a better performance than the non-recombinative EMCMCs. The performance of MCMCs are diminished by the coupled acceptance rule αC; MCMC is sampling more efficient than mutC and rEMCMC is better than rEMCMCC. The mutation rate greatly influences the performance of non-recombinative MCMCs; a high mutation rate decreases the performance of the algorithm. The swapping probability influences less the efficiency of the recombinative EMCMCs. rEMCMC and rEMCMCC perform best for high swapping probabilities, whereas MCMC and mutC perform best for low mutation rates.

Sampling from BQP

In the following, we have performed experiments with the binary quadratic programming problem (BQP) to show, on a more elaborated example, that recombination is useful for sampling. The fitness function of an individual x is f(x) = ∑j=1[ell]k=1[ell]F[j][k] · x[j] · x[k], where F[j][k] is the element on the j-th row and on the k-th column of a matrix F of integers, both positive and negative and x a binary string (e.g., x· is 0 or 1). Then, F’s size is [ell] · [ell].

The interaction between two or more positions of the BQP problem depends on the matrix F’s density, which is defined as the number of non-zero elements divided by the number of total elements in the matrix. The density is then between 0 and 1, where 0 means no interaction between positions and 1 means maximum interaction—that is every position depends on every other position. For our experiments, we generate random matrices with density 0.1.

These fitness values are positive and negative. However, the (unnormalized) probabilities of a distribution can be only greater than 0. Therefore, we add to all the fitness values a fixed positive integer transl; every value that now is equal or below 0 is assigned with the value 0.01. The unnormalized probabilities are equation M144, when f(x) > transl and, otherwise, equation M145.

In this section, we show that recombination can improve sampling. We first discuss the experimental methodology available to measure and compare the performance of EMCMCs. Second, we show experimental results on a BQP problem on 20 bits. By expanding the target distribution, we are able to compute the distance between this distribution and the true distribution. At last, we show results on a larger search space, for l = 100 bits. Unlike for the previous example, we are not able to expand this distribution, and therefore we are constrained to use less precise methods to assess the performance of (E)MCMCs. For both experiments, we compare the five (E)MCMC algorithms described above: three non-recombinative (E)MCMCs—that are one long chain MCMC, MIC and mutC—and two recombinative EMCMCs—that are rEMCMC and rEMCMCC.

Experimental methodology

To assess the efficiency of various EMCMCs we focus on monitoring how fast an MCMC is mixing and how well the samples spread over the entire target distribution after a fixed and rather small number of generated individuals. There is no generally acknowledged methodology on measuring how “close” a set of samples generated with a real-coded MCMC is to the true target distribution. Wolpert and Lee [26] argue that a good approach is to use the Kullback-Leibler (KL) distance between an approximation of the sampled distribution and a discrete approximation of the true distribution.

To measure the speed with which an algorithm samples the search space, Roberts et al. [21] recommend to monitor the acceptance probability of an algorithm. They analytically and experimentally study the behavior of a standard MCMC using a normal distributed mutation with fixed and equal variances in all dimensions. The target distribution is a multivariate normal distribution with standard deviation of 1.0 in all dimensions and no correlations. They conclude that a very high or very low acceptance rate of the MCMC indicates slow mixing, and a good acceptance rate is between 0.2 and 0.5. A high acceptance rate and a high performance (e.g., the KL-measure close to 0) indicates a well performing algorithm that mixes fast. Analytically computing the optimal acceptance probability is only feasible for very simple target and proposal distributions and when using the Metropolis acceptance rule. Here, we restrict ourselves to experimentally monitoring the acceptance probability.

For the tested recombinative EMCMCs, we have good performance (e.g., KL distance) even for very high acceptance rates that shows that recombination can improve the mixing of MCMCs. Furthermore, we show that algorithms with similar acceptance probabilities can have rather different performance.

A 20 bits BQP

For the first experiment, we set the string length to [ell] = 20 and transl = 50. Since the F’s density is 0.1, only 40 elements of F have non-zero values. The non-zero integers are generated at random from the interval [−100, 100]. For the generated matrix F, we have found the maximum fitness value 146; when this is translated, the maximum unnormalized probability value is 196. We group the individuals with the same value to generate the histogram and we also store the number of individuals with the same value.

We set the population size N = 20. Each generation, all individuals are randomly coupled in N/2 pairs such that each individual belongs to exactly one pair. We have performed experiments for various mutation rates (from 0.05 to 0.5) and swapping probabilities for the uniform recombination (from 0.05 to 0.5). With each algorithm, we generate 20,000 individuals; our measurements are averaged over 50 runs. We throw away the first 10,000 generated individuals to diminish the impact of the starting points over the performance of the algorithms. This is called the burn-in period. Thus, in total, we sample 10,000 “useful” individuals from which we generate Table 2 and the graphs from Fig. 3.

Table 2
Efficiency of (E)MCMCs for a BQP on 20 bits: the KL distances and acceptance probabilities
Fig. 3
a The frequencies and b the percentage of solutions found for each fitness value for MIC and rEMCMC on BQP on 20 bits; c how many times the sampled frequencies differ from the true distribution

The search space is 220. By expanding the target distribution, we are able to compare the frequencies of samples generated with the tested (E)MCMCs with their value in the true distribution. In Table 2, we compute the KL distance and acceptance ratio for the five (E)MCMCs. Mann-Whitney nonparametric two-sided test with significance level p < 0.05 is used to verify if KL distances of the five tested algorithms are sampled from different distributions. The algorithms have statistical significantly different output except with mutC and MCMC.

The best algorithm, with the significantly lowest KL distance and the highest acceptance ratio, is the recombinative EMCMC, rEMCMC. The only difference between MIC, the algorithm with second best KL distance, and rEMCMC is that rEMCMC uses recombination and MIC does not. Further, we observe that the two recombinative EMCMCs, that are rEMCMC and rEMCMCC, have a higher acceptance probability than the three non-recombinative EMCMCs, that are MCMC, MIC and mutC. That indicates that the recombinative proposal distribution Qm×Qunif, by exploiting the commonalities of the search space, is a “better” proposal distribution than Qm.

Furthermore, as we already observed in the analytical experiments, the coupled acceptance rule αC has a negative influence over both recombinative and non-recombinative EMCMCs. Even though using αC, the recombinative rEMCMCC has the third best KL distance and the second acceptance ratio, whereas mutC is the worst algorithm of all. We explain the good behavior of rEMCMCC by synchronizing the individuals in the family with the uniform recombination: children that inherit the common parts of their parents have similar fitness with the parents and the algorithm accepts more individuals. In opposition, uniform mutation independently proposes two individuals in random directions; then, if one of the candidates has very low fitness, there is a big probability that both children are rejected. As a consequence, mutC has a low acceptance rate and, thus, performance.

In accordance with the analytical results from the previous section, we observe that MIC, by using populations of MCMC chains has a lower KL distance than the standard MCMC. Note that the acceptance ratio for these two algorithms is the same, but their KL distance quite different.

In Fig. 3, we show experimental results for the two most performant (E)MCMCs presented in the previous section: MIC and rEMCMC. By using recombination, rEMCMC is a better sampler than MIC is: in frequencies, rEMCMC, see Fig. 3a, is closer to the true target distribution than MIC is. If rEMCMC samples with predilection in the high values of the target distribution, in opposition, MIC typically samples the low fit individuals. Furthermore, rEMCMC finds more higher probable solutions than MIC, see Fig. 3b. Overall, in Fig. 3c, we notice that the distribution sampled with rEMCMC is closer to the true distribution than MIC is. These results are in concordance with the ones in Table 2 from which we conclude that rEMCMC is the most performant algorithm for this particular problem by proposing individuals with recombination.

A 100 bits BQP

We now show that recombination can improve mixing on BQP with string size [ell] = 100, for which it is impractical to generate the target distribution. In this experiment, we cannot compute the KL distance. Furthermore, we do not know the maximum value of this function or if the values are uniformly distributed in some interval.

For this experiment, we compare all the five MCMCs as before, but we show the results only for the best two algorithms MIC and rEMCMC. Note that these two algorithms are the best performant algorithms in all three experiments.

Assuming, that the unnormalized values of the distribution are in a very large range we group our samples is using the individual’s number of ones to compare two (E)MCMCs algorithms. Given this grouping, we compute the frequencies, Fig. 4a, and the mean value, Fig. 4b, for each such a group.

Fig. 4
a The frequencies and b the average fitness of the found solutions for each number of ones for MIC and rEMCMC on BQP on 100 bits

Again, we set the density of F to 0.1; thus, approximately 1,000 elements of F are non-zero. We generate these non-zero integers with a uniform random distribution from the interval [−100, 100]. To generate a distribution with positive values, we set transl = 1,250. We set population size N = 100 and, each algorithm we run 50 times. With each algorithm, we generate 100,000 individuals which we throw away, and we use the next 100,000 generated individuals. Again, we vary the mutation rate and swapping rate from 0.05 and 0.5. In Fig. 4, we show results for MIC’s mutation rate 0.2, and rEMCMC’s swapping probability 0.5 and mutation rate 0.01. Then, MIC has an acceptance rate of 30%, whereas rEMCMC has an acceptance rate of 78%. We mention that we have performed experiments with various mutation and swapping probabilities but the results are not very different from the ones we currently show.

In Fig. 4a, we notice that rEMCMC samples slightly more individuals with a higher number of ones than MIC does. Except that, the distributions sampled by MIC and rEMCMC are similar and both are sampling especially from individuals with half number of zeros and ones indicating that the target distribution is symmetrically distributed around these individuals. Despite that, the mean values of the sampled individuals are remarkably larger for rEMCMC than for MIC. It seems that 100,000 individuals are not enough for MIC’s burn in whereas for rEMCMC it is. We also have performed experiments with single chain MCMC; we mention that the mean values are worse than of MIC. We explain that by the shape of this BQP: a lot of peaks with many low fit individuals. We therefore consider that MIC mixes slower than rEMCMC: N = 100 is not large enough to cover the number of these peaks and thus MIC will always have the problem to escape from these peaks to find the other useful ones. To sample the same amount of individuals, an increase in population size must be combined with a decrease in the MCMC’s time to run. The less time we allow an MCMC to run, the worse an MCMC samples from the search space and eventually, when population size goes to infinity, MIC is just a random sampler.

Conclusions and discussion

We discussed aspects from the Evolutionary MCMC framework, a class of population based MCMC algorithms that exchange useful information by using recombination and selection. The main issue for EMCMC algorithms is to improve the performance of the sampling process, or the convergence time to a desired distribution. Detailed balance is a straightforward and sufficient, but not necessary, condition for an irreducible and aperiodic EMCMC to converge to a given distribution.

We aim to increase the efficiency of MCMCs by the use of recombination. Recombination operators can generate “good” proposal distributions that exploit the structure of the search space such that EMCMCs using it converge faster to the target distribution. Of course, when the search space has no structure or the structure is not correctly matched with the recombination operator, the recombinative proposal distribution will offer no advantage and will most likely be as efficient as a uniform randomly generated distribution, or even worse in the worst case.

We proposed various recombinative proposal distributions on discrete spaces and we studied how to integrate them into EMCMCs with detailed balance. Since we consider only respectful recombinations, which are reducible, we have to combine recombination with mutation in order to obtain irreducible EMCMCs. We focus on discrete space recombinations and study the properties of discrete space EMCMCs resulting from the interaction of recombinative proposal distributions and MH acceptance rules. The analytical and experimental results show that EMCMCs can outperform the standard MCMC sampling algorithms by using recombination operators.

We have proposed and investigated various MH acceptance rules derived from EC’s selection strategies. In order to obtain a recombinative EMCMC with detailed balance, the children proposed by the recombination operator need to be all accepted or all rejected with the coupled acceptance rule.

Both analytical investigations and experimental tests show that the recombinative EMCMC in which a child individually competes against a parent in the standard MH acceptance rule is the best sampler. In the experimental section, for very large search spaces and small number of samples, this EMCMC, rEMCMC, samples high regions of the search space faster than an EMCMC using the coupled acceptance rule. Thus, in short time, rEMCMC approximates the desired distribution better. However, there is no theoretical guarantee that rEMCMC converges to the target distribution. We also proposed the nested EMCMCs that individually accept or reject fitted states with a EMCMC without detailed balance. Even though the nested EMCMCs on the proposed unbalanced EMCMC have theoretical value by indicating a correction term of the sampled distribution, its computation is impractical.

Finally, we also discussed a recombinative EMCMC without detailed balance but that can be useful for optimization purposes. It is a straightforward extension of the elitist replacement in an MH acceptance rule: two parents compete against two children and the best two from the four are selected for the next generation. This EMCMC can be used only for optimization since it is sampling mainly from the fittest regions of the sampled distribution at the expense of the less fit regions. Its disadvantage is that it can get stuck for a long time in good, but isolated, modes of the sampled distribution.

We conclude that one should be careful with adopting recombination and selection operators from EC into population-based MCMC framework. Population-based techniques that are suited for optimization can be less suitable for sampling and vice-versa. To have a positive impact on the sampling performance of interacting MCMC chains, recombination and selection techniques need to follow some design principles as outlined in this paper.

Open Access

This article is distributed under the terms of the Creative Commons Attribution Noncommercial License which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and source are credited.

Appendix 1: Proof for Proposition 7

Qm×r and Qr×m are irreducible because they have non-zero probabilities to go from any population to any other population, Qm×r > 0 and Qr×m > 0.

When Qr is symmetric, we have

equation M146

Qr×m and Qm×r are symmetrical for recombinations that swap alleles because mutation generates the alleles which differ in the two populations and recombination swaps them or vice-versa.

By means of an example, we prove that Qdif×m is non-symmetrical. Consider the current population of bits X(t) = {0, 1, 0} and the candidate population Y = {1, 1, 1}, the mutation rate of 1/3, and, for simplicity, the xor operator. We compute the probability to generate Y from X(t) with uniform mutation and then with xor recombination and the inverse probability to generate X(t) from Y.

Let’s consider all possible parent choices for xor. With the xor recombination, given the distance equation M147 between the first two bits, we generate 1 from the third bit of the current population 0; the intermediate population is now equation M148. The distance between the second and the third bit is also equation M149, and thus the intermediate population is again equation M150. Since the distance between first and second bits of the current population is equation M151, we generate 1 from 1 and the intermediate population is equation M152. When we mutate the intermediate populations, we have equation M153= (1/3)2 ·2/3 and equation M154 = (2/3)2 ·1/3. Computing in a similar manner the inverse probability, for all possible intermediate populations, we have equation M155.

To generate X(t) from Y with Qdif×m, we mutate Y to equation M156 and then swap with the xor operator the last bit given the difference between the first two bits resulting in X(t). Similarly, we mutate Y to equation M157 and swap the first bit of equation M158 or we mutate into equation M159 and do not swap the middle bit with xor since the difference between the first and the last bit is 0. We then have equation M160 = 1/3 · (2 · (2/3)2 · 1/3+1/3 ·(2/3)2) = 4/81.

We conclude that Qdif×m is not symmetrical since equation M161equation M162

Appendix 2: Proof for Proposition 8

Qm×mask is irreducible, since is has equation M163 If px = 1/[ell], the Qm×mask is equivalent with the mutation operator, since all alleles in the parents can be flipped with the probability 1/[ell]. Then Qm×mask is symmetric.

For pm = 1/2, we uniformly random generate the child yi+1 from the mask equation M164 Then, Qm×mask is symmetric since the common and uncommon parts of the parents and the children are randomly generated.

By means of an example, we show that Qm×mask is non-symmetrical for other values of pm and px. Consider equation M165 and equation M166. When yi = 1 and yi+1 = 0, the probability to generate yi is 1/[ell], and the probability to generate yi+1 is 1 − pm. The inverse probability is equation M167 = (1 − px)  · (1 − pm). When yi = 0 and yi+1 = 1, the probability to generate yi is 1 − 1/[ell] and the probability to generate yi+1 is pm. The reverse probability is now px · pm. Then equation M168 and equation M169 = (1 − px) · (1 − pm)+px · pm. We now have that if px ≠ 1/[ell] and pm ≠ 1/2, then Qm×mask is non-symmetrical. equation M170

Appendix 3: Proof for Theorem 1

We consider that the EMCMC resulting from the interaction between transition matrix Q and the (generalized) Metropolis acceptance rule is an EMCMC over the N dimensional search space EN. For ease of exposure and without loss of generality, let’s consider populations of two individuals N = 2. Two children {y1, y2} that are generated with some irreducible and symmetrical proposal distribution Q from two parents equation M171. Then equation M172.

The Metropolis acceptance rule in this case is equation M173. The transition matrix we denote with KC. The transition probability that two children y1 and y2 are generated and both are accepted is

equation M174

The rejection transition probability that both children are rejected is

equation M175

Let’s assume without loss of generality that equation M176. Then,

equation M177

and equation M178.

We now show that the detailed balance condition holds

equation M179

The marginal transition probability to generate equation M180 from y1 when summing over the variables of the second chain is

equation M181

The stationary marginal distribution of the first chain is equation M182

From the above equations we infer

equation M183

where we have used

equation M184

We conclude that the marginal target distribution for the i-th chain is Pi(·) and that this EMCMC algorithm has the stationary distribution P1(·)×P2(·).

The EMCMC algorithm is irreducible since the proposal distribution Q is irreducible. This algorithm is aperiodic since the Metropolis algorithm, by construction is aperiodic. We conclude that this EMCMC is ergodic with the stationary distribution P1(·)×P2(·), where Pi(·) is the marginal target distribution of the i-th chain. equation M185

Appendix 4: Proof of Proposition 10

Consider two parents equation M186 and equation M187 and their two generated children y1 and y2, and the coupled acceptance αC that accepts/rejects both states and the probability to accept/reject one or both children with α1.1. At first, we prove the first inequality from Proposition 10

equation M188

The right side of this inequation can be rewriten as

equation M189

because equation M190 and, thus equation M191. The left side of the inequality 3 can be rewritten as

equation M192

The inequality 3 holds since

equation M193

We now prove that the rejection probability for a less fit child, equation M194, is larger for K1.1 than for KC when the second child is more fit than the second parent, equation M195. Then

equation M196

The right side of the inequality 4 is

equation M197

Rewriting the left side of the inequality 4

equation M198

The inequality 4 holds since

equation M199

Following the same line of reasoning, the rejection probability that a less fit child, equation M200, is lower for K1.1 than for KC when the second child is less fit than the second parent, equation M201 follows directly.

We now show that the EMCMC defined by K1.1 has detailed balance if and only if

equation M202

The detailed balance should hold only in the case that two different children are proposed but only one of them is accepted and the other is rejected

equation M203

Or the above equation holds if and only if

equation M204

If the detailed condition holds and the proposal distribution is irreducible and symmetrical, the EMCMC is ergodic and converge to the target distribution P1(·)×P2(·). equation M205

Appendix 5: Proof of Proposition 11

We show that ECA does not have detailed balance for any non-uniform distribution. Let’s consider three states, equation M206 such that equation M207. In our discussion from Sect. 5.3 we show that

equation M208

If ECA has detailed balance, then

equation M209

Because Q is symmetrical we further have

equation M210

KC is also symmetrical, so further we have that

equation M211

The above equation does not hold since for the first sum requires that

equation M212

and for the second sum that

equation M213

We conclude that KECA does not have detailed balance for any Q.

When Q is irreducible, this algorithm is irreducible since it can generate any state from any other state with a non-zero probability. Therefore, the algorithm is also aperiodic since the KC is aperiodic and thus the target distribution of ECA exists. equation M214

Appendix 6: Proof of Proposition 13

The proof is split in three parts, corresponding with the three equations in the proposition.

  1. Both children are different from their parents. Then
    equation M215
    The reverse transition probability is
    equation M216
    We now have
    equation M217
    where Q is symmetrical and thus equation M218. By replacing the definition of acceptance rule α, we have
    equation M219
    and
    equation M220
    The coupled acceptance probability for the nested acceptance probability is now
    equation M221
    The nested transition probability now is
    equation M222
  2. One child is different from its parent. For example y2 is rejected and y1 is accepted. Then
    equation M223
    The reverse transition probability is
    equation M224
    We now have
    equation M225
    where equation M226. Now, the coupled acceptance is
    equation M227
    where equation M228. The second equation from the proposition now follows directly.
  3. Both children are rejected. Then
    equation M229

equation M230

Appendix 7: Proof of Proposition 14

The coupled acceptance probability for the nested acceptance probability is now

equation M231

where Q is symmetrical

equation M232

and

equation M233

The equation in the proposition follows directly.equation M234

Contributor Information

Madalina M. Drugan, ln.uu.sc@aniladam.

Dirk Thierens, ln.uu.sc@krid.

References

1. Andrieu C, de Freitas N, Doucet A, Jordan M (2003) An introduction to MCMC for machine learning. Mach Learn 50:5–43
2. Baluja S, Davies S (1997) Using optimal dependency-trees for combinational optimization. In: Proceedings of international conference of machine learning (ICML’97). Morgan Kaufmann, San Francisco, pp 30–38
3. Campillo F, Rakotozafy R, Rossi V. Parallel and interacting Markov chain Monte Carlo algorithm Math Comput Simul 2009. 79123424–3433.34331178.65003254716510.1016/j.matcom.2009.04.010 [Cross Ref]
4. Chow CK, Liu CN. Approximating discrete probability distributions with dependence trees IEEE Trans Inf Theory 1968. 14462–467.4670165.2230510.1109/TIT.1968.1054142 [Cross Ref]
5. Corander J, Ekdahl M, Koski Timo. Parallel interacting MCMC for learning of topologies of graphical models Data Min Knowl Discov 2008. 173431–456.456244865610.1007/s10618-008-0099-9 [Cross Ref]
6. Doucet A, de Freitas N, Gordon N, editors. Sequential Monte Carlo methods in practice. Berlin: Springer; 2001.
7. Drugan MM, Thierens D (2004) Evolutionary Markov chain Monte Carlo. In: Artificial evolution, LNCS 2936. Springer, Berlin, pp 63–76
8. Drugan MM, Thierens D (2005) Recombinative EMCMC algorithms. In: Proceeding of IEEE congress of evolutionary computation, CEC’05. IEEE Press, Piscataway, pp 2024–2031
9. Fearnhead P. Special issue: adaptive Monte Carlo methods Statistics and Computing 2008. 184341–480.480239081610.1007/s11222-008-9107-6 [Cross Ref]
10. Geraci J. A new connection between quantum circuits, graphs and the ising partition function Quantum Information Processing 2008. 75227–242.2421160.81330245360610.1007/s11128-008-0084-7 [Cross Ref]
11. Geyer CJ (1991) Markov Chain Monte Carlo maximum likelihood. In: Computing science and statistics: proceeding of the 23rd symposium on the interface. pp 156–163
12. Gilks WR, Richardson S, Spiegelhalter DJ, editors. Markov Chain Monte Carlo in practice. London: Chapman & Hall; 1996.
13. Hastings WK. Monte Carlo sampling methods using Markov chains and their applications Biometrika 1970. 5797–109.1090219.6500810.1093/biomet/57.1.97 [Cross Ref]
14. Kirkpatrick S, Gelatt CD, Vecchi MP (1983) Optimization by simulated annealing. Science 220(4598):671–680 [PubMed]
15. Laskey KB, Myers JW (2003) Population Markov Chain Monte Carlo. Mach Learn 50:175–196
16. Liang F, Wong WH (2000) Evolutionary Monte Carlo: applications to Cp model sampling and change point problem. In: Statistica sinica. pp 317–342
17. Mahfoud SW, Goldberg DE (1995) Parallel recombinative simulated annealing: a genetic algorithm. Parallel Comput 21:1–28
18. Metropolis N, Rosenbluth AW, Rosenbluth MN, Teller AH, Teller E. Equation of state calculations by fast computing machines. J Chem Phys. 1953;21:1087–1092. doi: 10.1063/1.1699114. [Cross Ref]
19. Pelikan M, Goldberg DE, Lobo F. A survey of optimization by building and using probabilistic models Comput Optim Appl 2002. 215–20.200988.90052188308710.1023/A:1013500812258 [Cross Ref]
20. Radcliffe NJ (1991) Forma analysis and random respectful recombination. In: Proceeding of the fourth international conference on genenetic algorithms. pp 222–229. Morgan Kaufmann
21. Roberts GO, Gelman A, Gilks WR. Weak convergence and optimal scaling of random walk Metropolis algorithms The Annals of Applied Probability 1997. 71110–120.1200876.60015142875110.1214/aoap/1034625254 [Cross Ref]
22. Storn R, Price K. Differential evolution—a simple and efficient heuristic for global optimization over continuous spaces Journal Global Optimization 1997. 11341–359.3590888.90135147955310.1023/A:1008202821328 [Cross Ref]
23. Strens MJA (2003) Evolutionary MCMC sampling and optimization in discrete spaces. In: Proceeding of international conference of machine learning (ICML’03). pp 736–743
24. ter Braak CJF. Genetic algorithms and Markov chain Monte Carlo: differential evolution Markov Chain makes Bayesian computing easy Stat Comput 2006. 163239–249.249224223610.1007/s11222-006-8769-1 [Cross Ref]
25. Thierens D, Goldberg DE (1994) Elitist recombination: an integrated selection-recombination GA. In Proceeding of the congress on computational intelligence. pp 508–512
26. Wolpert DH, Lee CF (2005) Adaptive metropolis sampling and optimization with product distributions. Technical report, NASA Ames Research Center
27. Zhang BT, Cho DY. System identification using evolutionary Markov Chain Monte Carlo. Journal of Systems Architecture. 2001;47:587–599. doi: 10.1016/S1383-7621(01)00017-0. [Cross Ref]

Articles from Springer Open Choice are provided here courtesy of Springer