|Home | About | Journals | Submit | Contact Us | Français|
Stochastic models for finite binary vectors are widely used in sociology, with examples ranging from social influence models on dichotomous behaviors or attitudes to models for random graphs. Exact sampling for such models is difficult in the presence of dependence, leading to the use of Markov chain Monte Carlo (MCMC) as an approximation technique. While often effective, MCMC methods have variable execution time, and the quality of the resulting draws can be difficult to assess. Here, we present a novel alternative method for approximate sampling from binary discrete exponential families having fixed execution time and well-defined quality guarantees. We demonstrate the use of this sampling procedure in the context of random graph generation, with an application to the simulation of a large-scale social network using both geographical covariates and dyadic dependence mechanisms.
Stochastic models for finite binary vectors are widely used in sociology, with examples ranging from social influence models on dichotomous behaviors or attitudes (Coleman, 1964; Latané, 1996; Holyst et al., 2001; Robins et al., 2001) to models for random graphs (Holland and Leinhardt, 1981; Skvoretz et al., 2004; Robins et al., 2005; Wasserman and Robins, 2005). Exact sampling for such models is often quite difficult in the presence of dependence, generally leading to the use of Markov chain Monte Carlo (MCMC) as an approximation technique (Snijders, 2002; Hunter et al., 2008). While often effective, MCMC methods have variable execution time (i.e., time to convergence), can be slow to converge (Bhamidi et al., 2011), and produce draws whose quality can be difficult to assess (Gelman, 1996). Here, we present an alternative method for approximate sampling from binary discrete exponential families having fixed execution time and well-defined quality guarantees. In many cases, worst-case complexity of the sampling process is no worse than the worst-case complexity of an exact sampler for independent, inhomogeneous binary variables (i.e., order N, where N is the number of variables). We demonstrate the use of this sampling procedure in the context of random graph generation, with application to simulation of exponential family random graph models (ERGMs) fit to network data, and to the exploratory simulation of large-scale social networks using both geographical covariates and dyadic dependence mechanisms.
Let Y = Y1, . . . , YN be a finite vector of binary-valued random variables with joint support . In a social network context, each Yi represents an indicator for the presence of an edge between two vertices (i.e., Y is a vectorized adjacency matrix); in other settings, Yi might indicate e.g., an individual's agreement or disagreement with an item on a survey instrument, the presence or absence of an individual or organization at an event, or an individual's decision to engage in an observable behavior at some point in time. Without loss of generality, we may write the joint distribution of Y in discrete exponential family form as
where is a vector of sufficient statistics, is a parameter vector, and is an indicator function for membership in the support. Note that we may also write Y's joint distribution in a very different, hierarchical form:
With the exception of relatively trivial cases, neither Equation 1 nor Equation 2 is of much direct use: direct calculation of the normalizing factor of the former is generally prohibitive, and the partially marginalized distributions of the latter (which involve similar normalizing factors) are generally unknown. Standard approaches to simulating Y in the general case typically employ Markov chain Monte Carlo methods, exploiting the fact that
and hence that calculations depending only on probability ratios for alternative realizations of Y (as required, e.g., for Metropolis-Hastings algorithms) are often computationally inexpensive. (See, e.g. Morris et al. (2008) in the ERGM case.) Our approach in the present paper utilizes a different (but closely related) property of Equation 1, namely that
where refers to all elements of Y other than the ith, refers to the vector Y such that the ith entry is set to 1, and refers to the same vector with the ith entry set to 0. (We have here suppressed the support term for clarity of notation: is obviously equal to 0 , and to 1 if .)
To see how we may leverage this property, let us consider Equation 2. While we cannot in general compute the marginal elements of this expression (the ith of which is a sum over a set of states that grows exponentially as 2N–i), we can bound them using the full conditionals of Equation 4. Specifically, let refer to the elements Y1, . . . , Yi–1 of Y. By definition,
The marginal probability that Yi will take state 1 (versus 0) given the states of the previous elements is thus a convex combination of the full conditionals for Yi = 1 over all combinations of subsequent states. Let us define two vectors of extremal full conditionals, α and γ, such that and . For simplicity in notation, let us similarly define . From the convexity of the sum in Equation 5, it follows immediately that αi ≤ βi ≤ γi, and that the vectors extrema thus bound the vector of marginal state 1 probabilities.1
This last observation has been used in prior work by Butts (2011) to provide bounds on the behavior of random graph models. It can also be used, however, as the basis for a novel approximate sampling scheme for any finite binary vector Y that differs considerably from the standard MCMC approach. We now turn to a description of this algorithm.
We begin by describing the approach in intuitive form, subsequently turning to the issue of efficient implementation. Let R = R1, . . . , RN be a vector of iid uniform random variables on the unit interval. Equation 2 suggests an obvious (exact) sampling scheme for Y: starting with Y1 and progressing in order Y2, Y3, . . . , Yi, set yi = 1 if , and 0 otherwise. This, of course, requires that we know the corresponding marginal distributions, which as noted are incomputable in the general case. In some circumstances, however, such knowledge may be unnecessary. Consider, for instance, the case in which (for some given realization r of R) ri < αi. Since the corresponding marginal probability that Yi = 1 is greater than or equal to αi, it follows that yi for this realization must be equal to 1. Likewise, if ri ≥ γi, yi must be equal to 0 (since we know that the probability that Yi = 1 is at most γi). Even where we cannot pin down β exactly, it may therefore be possible to “fix” many values of y in any given realization simply through the α, γ bounds. Furthermore, the fixation of this first collection of y values may – by allowing refinement of the bounds associated with the remaining variables – fix yet more values. In this way, the initial fixation process can spawn a “fixation cascade” that may result in many (or occasionally all) of the elements of y being exactly determinable. Of course, it may also be the case that this process terminates with some elements remaining unfixed; at this point, some intervention is required. The expedient we propose in this case is to approximate the first unfixed βi by a uniform draw between the (updated) values of αi and γi, fixing yi to 1 or 0 if ri is less than or greater than/equal to the approximation (respectively). Any fixation cascade created by this operation is then resolved, and the process is repeated until no variables remain unfixed. While the result of this process is not (in general) an exact draw from Y, it may prove to be a close approximation (especially if the initial fixation cascade covers most of the elements of Y). Unlike MCMC-based methods, the worst-case execution time for this algorithm is fixed and frequently linear in N, with additional performance enhancements available in some circumstances (as discussed below). Also unlike MCMC algorithms, this sampling method leads to a well-defined bound on the quality of the resulting draw. We return to this point in Section 3.
Our base algorithm is fairly straightforward: we advance through the elements of y, fixing each element as we proceed. (The order in which elements of y are visited is arbitrary, a fact that we exploit in section 2.2.1.) This process is shown in detail in Algorithm 1. Effectively, the operations performed when setting each yi consist of three steps: initialization, fixation, and perturbation. In the initialization step (lines 2-4), we compute an updated pair of bounds ( and ) for βi, and draw the random input ri. The bound pair is a refinement of the unconditional bounds (αi,γi), reflecting the fact that the elements y<i of y are known by the ith iteration. This is the mechanism by which the “fixation cascade” described above is implicitly employed: previously encountered elements of y fixed by their r values have the potential to narrow the interval between α′ and γ′, thereby increasing the chance that the current element will itself be fixable. (Of course, fixation may also occur due to the impact of perturbed elements; we return to this issue in Section 3.)
After initialization, we proceed to the fixation step (lines 5-8). Here, we exploit the fact that to immediately recognize that yi|y<i is necessarily equal to 1 if , and equal to 0 if , setting yi accordingly. In some cases, of course, fixation fails: if , we cannot say where βi lies, and thus cannot set yi. We then proceed to the perturbation step (lines 10-14), drawing an estimated threshold and setting yi accordingly. Having set the value of yi (either immediately or following perturbation), we then proceed to the next element of y, repeating this process until all elements have been set.
As we visit each element of y once, the complexity of Algorithm 1 is clearly , where f(N) is the worst-case complexity of the initialization step (i.e., determining the bounds). In many cases (including network models based on degree, reciprocity, and mixing terms), this step can be performed in constant time, leading to a worst-case complexity that is linear in the number of variables. As this is also the worst-case complexity for simulation of N inhomogeneous Bernoulli trials, Algorithm 1 can clearly achieve optimal performance in such settings. Also, unlike MCMC algorithms (which must be run for an unknown period to obtain convergence), the execution time for Algorithm 1 is fixed and independent of the distributional properties of Y. These are both attractive features for applications requiring user-unsupervised sampling (e.g., normalizing factor computation), wherein convergence assessment is difficult and predictable execution time is highly desirable.
Various modifications and/or refinements of Algorithm 1 are possible, either to enhance speed or to improve the quality of the resulting draws. Here, we discuss several such refinements, before turning to the problem of quality assessment.
One observation that can prove useful when constructing refinements to the sampling process of Algorithm 1 is the fact that the order in which the elements of Y are given in Equation 2 (and hence the sampling algorithm) is arbitrary; we can thus re-order the sequence in which we visit the elements of Y within the sampling algorithm, so long as our choice does not depend on the realizations of R or previous elements of Y. A straightforward application of this idea can be used for quality improvement when the elements of Y are heterogeneous, with some close ex ante to fixation. Specifically, we modify Algorithm 1 by adding an ordering step prior to the main loop, in which we sort the elements of Y in ascending order by γ – α . This results in an algorithm that begins by visiting the elements of Y most likely to be fixed, increasing the probability that the state of later elements will be decided by Y states assigned through a fixation cascade involving only R-fixed elements, rather than by a cascade “contaminated” by perturbation. In the general case, this results in a total complexity for the sampling algorithm; where f(N) grows more slowly than log N, the pre-sorting operation will dominate the worst-case complexity of the sampler. Where f(N) grows more quickly, by contrast, the complexity of the pre-sort is on the same order as the complexity of the sampler itself, and cost is not a substantial consideration in its use.
Another family of refinements that can be employed when most elements of α are large (close to 1) or most elements of γ are small (close to 0) is to take the elements of y equal to 1 (respectively 0) by default, and explicitly evaluate only elements at high risk of taking the alternate state. (This is a technique used in the simulation of sparse Bernoulli graphs, where the expected number of Yi equal to 1 is often ; see, e.g. Batagelj and Brandes (2005).) Consider, for example, the “sparse” case in which all γi ≤ ρ for some . We may then modify Algorithm 1 as follows. Rather than evaluating each yi explicitly, we begin by drawing k ~ Geom(ρ) (where Geom is the geometric distribution), setting i = k + 1, and treating all y<i as 0. We then explicitly draw ri ~ U(0, ρ) and evaluate yi as usual. Once yi has been set we draw a new k from the same distribution, set i = i+k, and update the new yi (again treating all intervening elements of y as 0, and taking ri ~ U(0, ρ). This process is repeated until all elements of y have been set. Since updating y cannot increase any element of γ, it is not necessary to change ρ during execution; although probabilistically identical to Algorithm 1, this modified procedure will reduce the number of element update executions by an expected factor of 1 – ρ. (A directly analogous procedure can be used for the large α case.) While the efficiency gain by this modification is constant in ρ, this can be consequential where ρ itself varies with N. The above-cited case of sparse graphs provides one example of such a circumstance, but other cases with similar properties (e.g., grid-based point process approximations) occur in a number of fields.
As noted in Section 2, when fixation fails we must approximate the true probability that Yi = 1, βi, in order to carry out the perturbation process. Intuitively, setting τ (in the notation of Algorithm 1) equal to βi will produce an exact draw from the sampler; although we do not know the values of β, we may still hope for a more intelligent approximation to the uniform draw of Algorithm 1. Here, we describe two approaches to this problem, both of which can improve performance in practice.
Our default method shown in Algorithm 1 is equivalent to treating βi as entirely unknown, save for the fact that it lies within the interval . In practice, however, we are not entirely ignorant regarding β. For instance, consider the case in which the “1-density” of Y, , is exactly or approximately known. Clearly, for a randomly selected element Yi of Y, Eβi = δ; thus, we have in this case some “a priori” knowledge about β that could in principle be used to aid simulation.
To formalize this idea, let the pmf f represent our “prior” beliefs about an arbitrary βi. (We will here assume that our “priors” are exchangeable; the term is used in quotation marks to emphasize that our use here is heuristic, and not assumed to satisfy the criteria for a proper Bayesian analysis.) Imagine that we subsequently learn that . The corresponding “posterior” density for βi is then
where I is defined as before as a set indicator function, and F is the prior cdf of βi. For random input ri, then, the chance of setting Yi = 1 corresponds to
with the middle term being equal to , as derived above.
Taking f to be the uniform distribution on (0, 1) gives the default perturbation method. However, there is no reason that we must employ this distribution. For instance, a natural alternative choice would be Beta(δk, (1 – δ)k), where δ is the “1-density” (or an approximation thereto) and k is a user-selected dispersion parameter. In a conventional Bayesian setting, k = 2 provides a prior weight equal to that of a uniform distribution, with k = 1 being the weight of the Jeffreys prior for a binomial likelihood. (Specifically, both of these distributions can be realized by the appropriate choice of k for δ = 0.5.) Here, our selection is heuristic, and should be used to produce threshold distributions that have reasonable behavior when the bounding method fails. In the absence of additional information, small k (e.g., ≈ 1) with δ chosen as above leads to a distribution that is relatively flat in its middle range, but otherwise weighted heavily towards extreme thresholds (in the direction of δ). For models that combine low/high baseline probabilities of Yi = 1 with (respectively) strong excitory/inhibitory dependence effects, the resulting threshold probabilities are likely to be much better approximations of the true β elements than would be obtained by the default method.
As a final comment on this technique, it should be noted that additional information on the distribution of β can gleaned in settings for which some observation from Y (or an approximation thereto) is available. A typical case is that in which Y|θ is chosen as a data model for observation y, which (to the extent that the model is adequate) can then be viewed as an approximate draw from the random process. In this latter case, Equation 4 gives a direct method for evaluating the full conditionals for a single Y element, given the rest of the realized process. Although the full conditionals for a single realization are not equal to the true β values for that realization, they nevertheless provide a rough approximation thereto. (This is, indeed, the basis for the pseudo-likelihood approximation of Besag (1974).) Inspection of the properties of the full conditional distribution may thus prove useful in choosing f, e.g., by suggesting good approximate δ values and/or choices for k in the Beta case, and/or alternative distributional forms. Since f is only used when the fixation process fails, its selection will have little impact for models requiring little perturbation. For less constrained models, however, intelligent approximation of β may greatly aid performance. Fortunately, one virtue of the sampling strategy employed here is that we may predict this degree of constraint on an ex ante basis. We turn to a discussion of this problem in Section 3.
As shown in Equation 5, the marginal probability is a convex combination of the full conditionals of Yi, mixed over the possible realizations of Yi+1, . . . , YN. Direct computation of this quantity is infeasible, but in some cases it may be possible to approximate it via a process of pseudo-marginalization. The general principle is as follows. Imagine that, via an exact simulation, we were to obtain a set of draws y(1), y(2), . . . , y(m) from Y. Plainly, a consistent and unbiased estimator of is
with convergence occurring in the limit as m → ∞. While we do not have an exact sample from Y, we may instead consider a set of draws from some random variable Y′ whose distribution is close to Y; we refer to the calculation of under the resulting sample as “pseudo-marginalization.” Intuitively, if Y′ is a reasonable approximation to Y (and/or if is relatively insensitive to ), then the pseudo-marginalized estimator of (hence βi) may be fairly close to the true value. Using the estimated βi values during perturbation can thus improve simulation performance, without losing quality guarantees.
The procedure for pseudo-marginalization is as follows. We begin by positing a distribution Y′, and drawing an initial marginalization sample M = (y′(1), . . . , y′(m)), y′(i) ~ Y′. Ideally, Y′ should be as close to Y as possible, while still being easily simulated. In practice, we suggest starting with an independent Bernoulli model, calibrated to preserve overall average features of Y (e.g., frequency of 1s, marginal covariate relationships, etc.). We then proceed with Algorithm 1, with the following simple modifications. Where fixation occurs (via the bounds), pseudo-marginalization is not employed. In the event that perturbation is necessary, the τ assignment of line 10 is replaced by per Equation 6, with M being used as the set of sample draws. Finally, once an assignment is made to yi (either by perturbation or fixation), we set y′(j))i = yi for each y′(j) M. Thus, at the ith iteration, , and the elements of M are identical to y at the conclusion of the algorithm. Otherwise, the sampling algorithm proceeds identically to Algorithm 1, and the quality assessment results of Section 3 are unaffected.
Although pseudo-marginalization does not improve the guaranteed quality of the sampler, it can in practice have a substantial effect. The total increase in computational and storage cost is proportional to m, and is thus a constant; however, this may still prove burdensome for large problems. In cases where dependence between elements is strong (especially if N is not too large), pseudo-marginalization can be a relatively “cheap” way of improving performance.
One useful property of the construction of Section 2 is that it provides a very direct means of assessing the quality of the approximate sample (both ex post and ex ante). To make this notion precise, define yt to be the unique draw from Y corresponding to realization r from R (using the exact sampling method described in the introduction to Section 2). We can then consider the “quality” of a draw y|r produced by our scheme in terms of its deviation from yt|r – that is, we consider a draw to be of high quality when it conforms closely to what would have been obtained under the exact sampling mechanism, given the same random inputs. Let us define the quality, Q, of y by the normalized inverse Hamming distance between y and yt given r (i.e., Q(y) = (N − DH(y, yt))/N, where DH is the Hamming distance). Q then varies from 0 in the worst case (i.e., all elements of y differ from the exact sample) to 1 in the best case (i.e., all elements of y are identical to the exact sample).
While we do not know Q(y) exactly, we can provide a lower bound by exploiting the properties of the fixation cascade. Consider some element yi of y, and let hi be an indicator variable equal to 1 only if (else 0). Clearly, if ri < αi or ri ≥ γi, ; in this case, let hi = 1. Moreover, let denote all elements of yj such that j < i and hj = 1. Define the bounds
in direct analogy to the pair of Algorithm 1. If or , then we may fix the value of yi by conditioning only on prior elements that are known to be equal to their corresponding elements in yt. Since the set of Y realizations over which the extrema are taken for is a superset of the set for which are taken (and a subset of the set for which (αi, γi) are taken), it follows that we have and – thus, in any case in which yi would be fixed by , it will be fixed to the same value by the extrema of Algorithm 1. We may therefore set hi = 1 where fixation under occurs. By induction, this property holds for arbitrary i, and hence, for all i, hi = 1 only if . Since , this gives us the ex post quality bound .
Although it should be noted that the fixation under required for h is stronger than the associated condition of Algorithm 1 (and hence one cannot simply set hi = 1 where the latter occurs), computation of h can be performed by maintaining a separate set of bounds (i.e., α″ and γ″) and incrementing hi when the stronger fixation condition is met. This is generally a straightforward modification of the base algorithm, and is not given here.
Ex post evaluation is useful, in that it provides us with a sense of how well the algorithm performed in a given instance. Ex ante evaluation, however, is also useful in determining whether the sampling algorithm is likely to prove successful. In this regard, note that we can place an immediate bound on the expected value of Q(Y) by exploiting the initial fixation condition: specifically, . This may, of course, understate the quality of the approximation method in systems that are prone to long fixation cascades; typically, however, one would expect such systems to have fairly small γ – α intervals for most elements (since cascades are otherwise difficult to propagate), and thus to have reasonably large quality bounds. Regardless, a distribution with very high ex ante quality bounds is likely to be effectively approximated by the algorithm, and high EQ may thus be taken as diagnostic of situations in which use of the procedure is fairly safe.
While the above provide overall indicators of quality, it is apparent that we can do better: since we have, for each element, both the ex ante bound on its fixation (1 – γi + αi) and the ex post realization of its fixation state (both whether it was fixed prior to the first perturbation, and otherwise the state in which it was ultimately placed), we can construct more refined quality diagnostics that are tuned towards specific applications. For instance, we can place an upper bound on the number of elements of y mistakenly identified as 0 (or 1) by the number of i such that yi = 0 (respectively yi = 1) and hi = 0. Similarly, the ex ante probability that a given Yi will be correctly classified as a 1 is at least αi, and the corresponding probability for correct classification as a 0 is at least 1 – γi. More intricate diagnostics can be developed for applications such as random graphs, in which specific configurations of variable states may have particular significance.
As a final note, it must be emphasized that the ex post diagnostics should not be used to guide the sampling process itself (e.g., by repeatedly sampling until a high-quality draw is obtained). This is because Q(y) is not in general independent of yt, and the two may in fact be strongly related. (Consider, e.g., the case of simulating a random adjacency matrix with dependence induced by a triangle term. Extremely sparse realizations will have few potential two-paths, leading to long initial fixation cascades, and subsequently higher quality. Selecting for high quality draws will thus lead to a bias in favor of low-density, unclustered graphs – quite the opposite of what the model is intended to produce!) On the other hand, quality diagnostics can be safely used to place bounds on the possible states of yt, to identify likely weaknesses in the approximation employed, etc. These considerations do not apply to ex ante diagnostics, which (being independent of R) are not subject to the same selection effects.
Where the lower bound on quality, , is high, it is evident that the bound sampler is close to exact; however, the bound sampler may still perform well even when is relatively low. In general, assessing performance in such cases requires comparison with high-quality MCMC runs (as shown in section 5.1). In a few cases, however, one can posit model families with edgewise dependence whose behavior can be solved exactly; in these cases, we may compare the bound sampler (and MCMC) directly to the behavior of an exact sampler.
Here we provide an example of such an analysis, using an exponential family form of the well-known U|man model (Wasserman and Faust, 1994). The U|man (standing for “uniform, given probabilities of mutual, asymmetric, and null dyads”) is a homogeneous random digraph family that incorporates biases in mean density and reciprocity. In exponential family form, this distribution may be written as
where Y is a directed adjacency matrix of dimension n × n, te is the count of edges within Y, tm is the count of mutual dyads (i.e., node pairs i, j such that Yij = Yji = 1), and θ = (θe, θm) is a vector containing “edge” and “mutuality” parameters (respectively). Intuitively, the edge parameter corresponds to the inverse logit of the probability of an (i, j) edge conditional on there being no reciprocating (j, i) edge, while the mutuality parameter corresponds to the change in the inverse logit of this probability when a (j, i) edge is present. Although dyads are independent under U|man, it thus follows that edges still depend on one another, and this dependence can be extremely strong. Because all dependence here is governed by a single parameter (θm), and because the resulting model is tractable, the exponential family form of U|man is a useful system for studying the impact of dependence on the bound sampler.
A first natural question to ask of the U|man system is how (unperturbed) fixation probability varies with θ. Because this model is dyad independent, it suffices to consider the case of a representative i, j dyad. Per Algorithm 1, either (i, j), (j, i), or both may be initially fixed; if exactly one is initially fixed, then the other will necessarily be fixed by a cascade. (This last follows immediately from the fact that the probability of an (i, j) edge depends only on the state of the corresponding (j, i) edge.) It follows, then, that the probability that the (i, j) edge will be neither directly nor indirectly fixed is equal to the probability that neither (i, j) nor (j, i) is initially fixed. Let k be the position of the (i, j) edge variable in the Algorithm 1 sort order. By homogeneity, the non-fixation probability is then equal to
(where we have exploited the symmetry between α and γ with respect to the sign of θm and the square to write the two terms in the above difference in arbitrary order).
To provide an intuitive sense of how this varies in practice, Figure 1 shows (scaled to percentages for ease of display) as a function of θe and θm. As can be seen, fixation is near total except with respect to two “lobes” of high |θm|, one occurring where the mutuality effect is strongly positive and the edge effect is somewhat negative and the other occurring where the mutuality effect is strongly negative and the edge effect is somewhat positive. These cases (with a strong mutuality effect that is largely balanced by the baseline density) are those that maximize the γ – α gap, and hence that minimize .
To get a better sense of how performance varies above and beyond what can be guaranteed by , we here show a simple simulation study comparing mean edge, mutual, and triangle counts under the bound sampler to their U|man expectations. By way of comparison to standard methods, we also show the performance of a basic (single random edge update Metropolis) MCMC algorithm with various choices of burn-in time. For all simulation methods, we estimate the expected subgraph counts using the sample means from 1,000 independent draws; to ensure independence, each MCMC-based draw was taken from an independently run Markov chain seeded with an iid uniform graph and simulated forward for the indicated number of iterations.2 Perturbation for the bound sampler was performed using the pseudo-Bayesian method of Section 2.2.3, with the pseudo-prior being a beta distribution with expectation equal to the expected density and a total concentration (i.e., parameter sum) of 6 (chosen to produce unimodality while still being diffuse). The simulation target in each case was a U|man distribution for a digraph of order n = 25, with θe and θm each systematically varied in an 11-point sequence from −3 to 3. The parameter range was selected to cover density levels from the nearly empty (mean degree less than 1) to the nearly complete (mean degree greater than 23), with a corresponding range in tendencies towards or away from mutuality. MCMC samples were taken from chains of length n (25), n2 (625), and n3 (15,625).
To assess simulation quality, we compare the mean counts for edges, mutuals, and triangles under each sampling scheme to the exact values from the corresponding U|man distribution; we employ the mean absolute errors in these counts at each parameter value as our performance criterion. Figures 2--44 show the corresponding error surfaces (darker values indicate greater deviation) along with the mean absolute errors at each simulated value (rounded to the nearest integer). As a reference, the maximum numbers of edges, mutuals, and triangles for these digraphs are respectively 600, 300, and 2,300.
Figure 2 shows the error in the expected edge count (and hence density) under each method. As expected, the “gold standard” MCMC with a chain length that is long compared to the number of edges (bottom right) produces nearly flawless performance over the entire parameter space. On the other hand, the bound sampler (top left) also does quite well: it shows no or very minimal errors over much of the parameter space, with error rates that are still quite modest (< 4%) in the regions of worst performance. Although error is obviously related to (errors cannot be large when is high), the sampler can still perform well when is lower: for instance, the region near θe (–2, –1), θm ≈ 3 (and its mirror image) has the worst quality bounds over the parameter range considered, but the simulated densities here are quite accurate. It is interesting to contrast this performance with that of Markov chains that are poorly mixed. When chain length is on the order of the number of nodes (top right panel), densities are generally far from their target values. This is not surprising (given that the chain cannot update many edges), but serves as a cautionary. Perhaps more significant is that chain lengths on the order of the number of edge variables (bottom left) are still quite insufficient to yield reasonable results. While intervals much greater than n2 are easy to achieve for small graphs like those used here, typical default chain lengths for current ERGM MCMC tools (e.g., ergm v3.1) will prove inadequate for approximately n ≥ 100.
Mean absolute errors for for mutuality (number of mutuals) are shown in Figure 3. The results here largely mirror those of Figure 2: very long-chain (i.e., ) MCMC performs flawlessly, the bound sampler works well (with modest errors over certain parts of the parameter space), and shorter-chain MCMC runs do very poorly.
Finally, we consider the propagation of simulation error into higher order structures, by way of the triangle count. As Figure 4 shows, the results in this case are not qualitatively different from those shown in the other figures: we continue to see strong performance from the bound sampler and long-chain MCMC simulations, and poor performance for shorter MCMC runs. In particular, we note that we do not see a tendency for biases in mutuality rate to propagate destructively into the triangle formation rate for the bound sampler, which continues to show error rates of 4% or less across the sampled parameter space. While the bound sampler is indeed approximate, we thus find that it here approaches the performance of high-quality MCMC simulations even for the parts of the parameter space for which is not provably near-exact.
As a final comment on these results, it should be noted that the edge/mutual model allows constant-time initialization, resulting in a worst-case time complexity of order N (in terms of graph size, n2). The above experiments suggest that MCMC runs of substantially higher time complexity (e.g., as much as n3) are necessary to obtain superior results for this system in the general case. Obviously, the implications of this for actual practice are not direct—in a practical setting, we could for instance enhance MCMC convergence by using a seed draw from a distribution close to the target, or employ an alternative proposal distribution, and we could likewise enhance the bound sampler by means of the techniques described elsewhere in this paper—and there is in any event little reason to choose either the bound sampler or MCMC over exact sampling for U|man! However, these results do demonstrate that there are settings in which the bound sampler can produce draws whose quality would require a much more time-complex MCMC run to match. In light of the fact that mixing times for current MCMC algorithms on ERGMs can be extremely long (Bhamidi et al., 2011), it is reasonable to suspect that more such cases will be found in practice.
As noted above, the bound sampler may be employed in any setting involving a finite series of binary random variables whose joint distribution is specified in exponential family form. Currently, the most active area of such research in sociology centers on social networks, with exponential family models (ERGMs) serving as an increasingly widely used tool for specifying models on graphs. Here, we show two applications of the bound sampler to network simulation, both reflective of typical use cases in the social network field. First, we show how the bound sampler may be used to simulate draws from an ERGM fitted to observed network data (here, a communication network), comparing the results to what would be obtained from standard MCMC methods. Second, we provide an example of exploratory simulation, in which we use the bound sampler to investigate the effect of dependence (specifically, isolation avoidance) on a large-scale spatially embedded network. In this latter case, the network in question is sufficiently large to make MCMC simulation prohibitive, making the bound sampler the only effective means of exploring the impact of edgewise dependence on the structure of the resulting network.
To illustrate the use of the bound sampler as a tool to simulate networks from fitted ERGMs, we consider a simple model fit to an intraorganizational communication network collected by Butts et al. (2007). Specifically, we consider radio communication ties among all named communicants in the Port Authority Trans-Hudson (PATH) control desk network during the emergency phase of the 2001 World Trade Center disaster; each vertex represents an individual, and two individuals are tied if at least one radio call was made between them (in either direction). The resulting network (see Figure 5) contains 229 individuals and 269 edges (mean degree 2.35). As is evident from the figure, this network shows substantial degree heterogeneity, some but not all of which is associated with the presence of individuals occupying institutionalized coordinative roles (e.g., desk operator) within the organization (orange nodes). Although not obvious from visual inspection, this network is also relatively triangulated given its size and density (38 triangles, CUG p < 0.001), suggesting an underlying tendency towards triadic closure.
To capture these effects, we fit3 a simple ERGM to the PATH network using the following terms: differential mixing rates between and within nodes with institutionalized coordinative roles (ICRs); individual-specific fixed effects for degree (i.e., sociality terms); and a geometrically weighted edge-wise shared partner term (GWESP (Hunter, 2007)) to account for residual transitivity. The coefficients for the simulated model are summarized in Table 1. Note that the apparently negative effect of ICR status on interaction probability must be interpreted in light of the individual-level fixed effects: sociality is positively correlated with ICR status (Spearman's ρ = 0.33, p < 0.0001), and indeed the median sociality effect for those with ICRs is substantially higher than for those without (3.648 vs. 0.002). A more nuanced view is thus that those with ICRs are more likely to be highly active, but ICRs are less likely to mix with each other than might be expected given their overall activity level. Similarly, the negative GWESP effect indicates that triangulation is reduced, but this relative to the level of clustering that would be observed in a random graph with strong degree heterogeneity effects;4 although there are more triangles than would be expected due to size and density alone, there are fewer (and they are more dispersed through the graph) than would be expected given the strong variation in individual activity level.
To illustrate the behavior of the bound sampler on the PATH network model, we simulate 500 draws from the fitted ERGM; ex ante fixation probability sorting was used, and perturbations were made using a diffuse Beta prior distribution with expectation equal to the observed graph density. As a point of comparison, we also simulate draws from the “gold standard” MCMC method (Hunter et al., 2008), with a burn-in of 5 × 106 draws and a thinning interval of 5 × 105 (enough to revisit each edge variable an average of approximately 20 times between each draw; convergence was also verified using standard diagnostics (Goodreau et al., 2008)). Results from both simulations are shown in Figure 6. Our simulation quality index, , suggests that the sampler works well here: on average, approximately 99% of edge variables were subject to unperturbed fixation, with very consistent performance across simulation runs (panel (A)). Although the structure of a sparse graph can be sensitive to small numbers of edge variables, Figure 6 shows that the distributions of edges (panel (B)), triangles (panel (B)), degree scores (panel (C)), and ESP statistics (panel (D)) generated by the bound sampler draws fairly closely approximate the high-quality MCMC sample. Unlike the MCMC sample, however, draws from the bound sampler are guaranteed to be independent, and require visiting each edge variable at most once.
Simulated draws like those given here are clearly adequate to approximate model behavior in cases for which MCMC is too computationally expensive, or where independence of draws is a priority (e.g., quadrature problems). Alternately, simulated draws from the bound sampler can be used to produce fully independent “seed” draws for MCMC simulations, allowing shorter MCMC runs and ensuring that the results from each run are independent. This may particularly useful in settings such as goodness-of-fit assessment, in which seeding runs with the observed data is problematic (although widely performed).
Although the bound sampler can be used as a complement to MCMC where both are available, its scalability also allows it to be used to generate networks that are too large to be effectively simulated by current MCMC methods. As an illustration of such an application, we here demonstrate its use to simulate draws from a social network model combining geographical effects and degree distribution bias. The network model we employ is an extension of the “social friendship” model of (Butts, 2002), an inhomogeneous Bernoulli graph model based on friendship data from a housing study by Festinger et al. (1950) used by Butts et al. (2012) to investigate the potential impact of geographical heterogeneity on network structure. Written in conditional ERG form, this model is a curved exponential family (Hunter and Handcock, 2006) specified by
where Dij is the geographical distance between i and j, pb (0, 1] is the baseline probability of an i, j edge (at distance 0), α (0, ∞) is a scaling parameter, and γ (0, ∞) is a shape parameter governing the form of the decay of tie probabilities at long distances. (For an investigation of the consequences of this model for network structure, see Butts and Acton (2011) or Butts et al. (2012).) One factor uncaptured in this original model is the tendency for individuals to seek to maintain personal networks within a particular size range; intuitively, individuals are incentivized to avoid friendlessness (i.e., having degree 0), and the cost of maintaining friendship ties implies that few persons can sustain large numbers of friendships. A natural way to model such effects is to supplement Equation 9 by adding statistics that count the number of vertices having a particular degree. E.g.:
where t0, . . . , tN–1 is a vector of statistics such that ti(y) is the number of vertices of G having degree i, and is a parameter vector. Any particular choice of creates a specific “bias” in the degree distribution, relative to what would be obtained if the baseline effects of Equation 9 were the only factors influencing network structure.
Because vertex degrees are not independent of one another, the model of Equation 10 no longer has Bernoulli structure: the conditional probability of observing a particular edge depends upon other edges within the graph. (Specifically, this model exhibits Markov graph dependence; see Frank and Strauss (1986).) The traditional approach to simulation of such models is the use of MCMC (see, e.g. Snijders, 2002), which scales poorly for large networks (Bhamidi et al., 2011). Here, we employ instead our proposed sampling procedure, taking draws from the model of Equation 10 with fixed distance effects and various choices of δ. For the baseline parameters, we use the median parameter estimates from Butts (2002), specifically pb = 0.529, α = 0.031423, and γ = 2.768. Vertex positions for this simulation are based on block level data from the year 2000 US Census, for the Choctaw, MS metropolitan statistical area. In particular, each household within each census block is assigned a graphical location uniformly at random within the block area (with individuals not belonging to households being treated as households of size 1). After assigning households to positions, each household member is assigned a position from a circular jittering distribution centered on the household location, with a maximum radius of 5 meters. This placement process thus preserves within-block population densities, while also respecting local clustering associated with variable household size; while simple, this model has been shown to reasonably approximate the micro-distribution of household locations (Almquist and Butts, 2012). The resulting population distribution is depicted in Figure 7 (|V| = 9,758).
For the δ effects, we here focus on the phenomenon of isolation avoidance: where ties are valuable, those who would otherwise be socially isolated may be motivated to seek them (hence leading to a reduced proportion of vertices with degree zero). In ERGM terms, this amounts to setting δi = 0 for i > 0, and δ0 to a negative number (reflecting the degree to which isolation is penalized), which is in turn equivalent to simply including an isolate term in the spatial model with parameter δ0 < 0.
To explore the potential impact of isolation avoidance in the Choctaw model, we take 100 draws using the bound sampler, with δ0 set respectively to 0, −1, −2, and −4 (reflecting an increasingly sharp penalty for isolate formation). Mean values for selected network statistics are given in Table 2; standard errors (not shown) are less than 1% of the mean value in all cases, and all differences are significant at . The first column of Table 2 shows the behavior of the baseline model without dependence effects: this is an exact sample (Q = 1), in which approximately 32% of nodes are isolated on average (Isolate Fraction). The average mean degree in the baseline case is 1.28 (unsurprisingly low, given the large isolate fraction), but the graph is relatively clustered (transitivity of approximately 0.18). As a simple measure of embeddedness in cohesive subgroups, we employ the mean core number, which is equal to the mean of the number of the highest-order k-core (Wasserman and Faust, 1994) to which each vertex belongs. Vertices with a core number of 0 are isolates; those with a core number of 1 belong to pendant trees, those with a core number of 2 belong to bicomponents, etc. The mean core number under the baseline model is approximately 0.95, reflecting the strong contribution from isolated and pendant vertices in the graph. Finally, we note that–as expected from the spatial model–there is a clear negative relationship between distance and the presence of a tie (expressed as a correlation between edge state and log distance).
How do these properties change when isolate avoidance is added to the model? For this, we look to the succeeding columns of Table 2, which show mean values for the above statistics at increasingly negative parameter values. First, we verify that, as δ0 falls, the fraction of isolated vertices does indeed decline (dropping from a 32% baseline to just over 1% at δ0 = –4. Simulation quality (as evaluated via ) remains high throughout, although the fixation rate falls slightly as the dependence among edges is increased (from 1 in the baseline model to approximately 0.994 at δ0 = –4). With these indications that the sampler is behaving as expected, we proceed to consider the impact of increasing isolation avoidance on other aspects of network structure. For instance, a reduction in isolation could be achieved in practice via two processes: one could see a net increase in edges (with new edges forming from those who would otherwise be isolated to other vertices in the graph), or one could see a reallocation of edges (with an increase in edges to otherwise isolated nodes being compensated by a loss of edges by nodes that would otherwise have higher degree). As the table shows, the former mechanism is active here–mean degree increases as δ0 falls, implying (given the fixed vertex set) a net increase in the number of edges in the graph. This increase is associated with an increase in the mean core number, reflecting the movement of vertices from the 0-shell (the set containing those in the 0-core but not the 1-core) into shells of higher order.
Does the movement of vertices from an isolated state to a better-connected one reflect embeddedness in dense subgroups, embeddedness in locally tree-like structures, or formation of lone dyads? We can tell from the declining transitivity score that the second of these possibilities is the dominant effect: given that isolates are being removed via edge addition, a declining transitivity score implies the formation of new incomplete 2-paths. Lone pairs do not produce such structures, and hence cannot be the source of the change; nor would the creation of dense, heavily triangulated subgroups. Several conclusions follow. First, since would-be isolates are not primarily pairing with each other, they are necessarily forming ties to vertices that would otherwise have higher degree; suppression of isolates thus has “spillover” effects for higher degree nodes. Second, since said potential isolates are generally embedding in locally tree-like structures, it seems plausible that isolate suppression increases connectivity without a corresponding increase in cohesion (i.e., formation of redundant paths). Intuitively, cohesion is driven heavily by the fraction of nodes having degree at least 2 (since any node of degree less than 2 cannot lie on a cycle), and hence social processes that act primarily to shift vertices from degree 0 to degree 1 will have a very different impact than those that act to increase degree more uniformly.
In addition to asking how the positional structure of the network changes as isolates are suppressed, we can also ask how isolate avoidance alters the spatial structure of the network. From Table 2, we can see that the negative relationship between tie probability and distance inherent in the spatial component of the model persists (with some mild fluctuations) as δ0 becomes more negative; for a richer view of the relationship, it is useful to examine the detailed distribution of ego-alter distances at each parameter value. Figure 8 shows the cumulative distribution functions (CDFs) for geographical distances among adjacent alters for representative draws from the bound sampler at each value of δ0 (with the dotted black line showing CDF for distances among all dyads). Consistent with the negative spatial correlation, there is a strong tendency for adjacent alters to be proximate relative to what would be expected under random mixing (all model CDFs are above the baseline). Introducing an isolation penalty, however, greatly increases the number of long-range ties in the network: while 90% of ties are within 100m in the purely spatial model, this drops to 83% for δ0 = –1, 72% for δ0 = –2, and 54% for δ0 = –4. Moreover, the disparity is greatest for edges to alters who are relatively close (e.g., < 1km) while being outside of ego's immediate household area.
Taken together with the above results, this gives us a more complete understanding of what happens as isolation avoidance increases. Nodes that would otherwise be isolated form ties to others (not merely other isolates) who are farther away than would otherwise be likely, but still favoring those at short distances. These ties tend not to be embedded in locally cohesive structures, but are instead part of tree-like complexes (possibly converting previously isolated vertices into pendents). The result is a graph that is better connected, with slightly greater spatial mixing, but without an increase in overall clustering. (Indeed, the median core number holds constant at 1—and the average maximum at approximately 4—over all simulations, consonant with conclusion that most change involves movement from the 0-shell into the 1-shell.) This is a distinctive pattern of behavior that would be difficult or impossible to explore without the ability to efficiently simulate large networks, and in particular using conventional MCMC-based sampling schemes. The bound sampler gives us the ability to investigate much larger systems than have been accessible in the past.
Like any sampling scheme, the bound sampler has inherent tradeoffs and complexities that may indicate—or contraindicate—its use in particular settings. Before concluding, we comment here on several issues related to the use of the bound sampler in practice, both in the context of network simulation and in other applications.
Section 4 analyzed bound sampler quality in the context of the exactly computable U|man family; section 5 showed the behavior of the bound sampler in several more realistic applications. Beyond these special cases, what can be said regarding the quality of bound sampler draws in practice? Noting that a lower bound on Q can be obtained by the mean distance between and in the absence of any prior fixation, it follows that distributions admitting tight Bernoulli bounds (in the sense of Butts (2011)) will have provably high quality. Butts (2011) discusses various factors influencing bound tightness, the most important of which being the degree of inhomogeneity in baseline probabilities and the degree of coupling between variables. In general, the most loosely bounded distributions are those involving strong global coupling with a high degree of homogeneity; highly inhomogeneous distributions with weaker and/or more local coupling admit tighter bounds. In the context of the bound sampler, the presence of fixation cascades can lead to better performance than a simple Bernoulli bound analysis would suggest, and this should thus be viewed as a sufficient but not necessary condition for high quality draws.
As we have noted throughout, both the bound sampler and MCMC samplers are approximate; MCMC draws can in principle be made arbitrarily close to exact, but in practice this depends on mixing time (which can be quite long) and requires convergence assessment. Where circumstances permit the generation of high-quality MCMC samples (and neither computational costs nor convergence assessment are problematic), MCMC is recommended. This is particularly true when the statistic is low, and bound sampler quality cannot be guaranteed. On the other hand, the bound sampler is recommended as an alternative to MCMC in settings where the latter's mixing time is prohibitively long (due either to coupling time between variables or system size) and the quality of the former is acceptable. Pilot simulations may be required to assess this. The bound sampler is also suggested as a possible alternative to MCMC when independence of draws is a serious concern, and/or where unsupervised execution is a priority (again, assuming that the sample quality is adequate). The third alternative of relatively short MCMC runs seeded with independent bound sampler draws may be an even better option in such cases, and should be considered when neither approach alone appears to produce adequate results.
One exception to the above guidance is when the objective is to bound a statistic on Y (e.g., place an upper bound on the expected number of triangles formed within a network). In this case, the bound sampler allows provable bounds on e.g., the noncentral moments or quantiles of the statistic in question from an exact sample, a property that MCMC does not share. In these applications, the bound sampler may be preferred to MCMC even when the latter produces higher quality draws overall.
Although we have focused here on its use in the context of ERGM simulation, the bound sampler can be used in any setting in which one seeks to simulate binary vectors and can represent one's target distribution in exponential family form. One such setting is the modeling of influence on binary attitudes (approve/disapprove, agree/disagree) or behaviors within a fixed network or other social setting; the exponential family models employed by Holyst et al. (2001) to model social impact dynamics (Latané, 1981; 1996) are one example of the types of influence process to which the bound sampler could be applied. Other settings lending themselves naturally to representation as dependent binary processes are modeling of dichotomous voting systems (for/against) and attendance or participation by persons or organizations at events. Examples of such processes have been famously studied e.g. by Coleman (1964); Granovetter (1978), and exponential families for the latter have recently been used in dynamic network models (see, e.g. Almquist and Butts, 2014) but could easily be treated in their own right.
Finally, it should be noted that, while we have here been concerned exclusively with binary vectors, generalizations of the bound sampler to more general contexts are possible. For instance, consider the case in which we seek to simulate a vector, Z = Z1, . . . , ZN, such that Zi has support 0, 1, . . . , Mi (with M known). Such random variables arise e.g. in the context of bounded count processes, as when an event has a fixed number of occasions to occur and the total number of occurrences is observed. By defining a binary vector, Y, such that (1) each Zi is the sum of Mi distinct elements of Y and (2) the corresponding vector of sums has the same distribution as the original Z, we may reduce the problem of simulating a count vector to the problem of simulating its associated binary representation. As a general matter, wherever such a reduction can be carried out, the bound sampler can then be employed to sample the original variable.
In this paper, we have introduced a novel alternative sampling algorithm for binary vectors with fixed execution time and quality guarantees. This approach stands in contrast to MCMC methods, which have variable convergence time (taking “convergence” to be defined in pragmatic rather than ideal terms) and which cannot be deterministically related ex ante to an exact sampler. Our approach is based on sequential approximations to the partial marginals of the joint distribution, and exploits the fact that bounds on these marginals can often be used to exactly determine the “correct” state of a given random variable (vis a vis an exact sampler to which our sampler is stochastically coupled). Where the state of the corresponding exact sampler is uncertain, the elements of the simulated draw that are fixed (i.e., known to match the exact draw) can still be used to place bounds on the properties of the vector as a whole. (For instance, the total number of 1s in the exact draw must be at least the number of 1s in the fixed portion of the simulated draw, and no more than this number plus the number of unfixed elements.) These properties make the bound sampler a potentially useful alternative to MCMC in unsupervised settings, or when principled bounds on a random quantity are more important than exact estimates.
Although there are many problems that can be expressed in terms of random binary vectors, the most active in current sociological research is the simulation of random graphs (a fundamental task for both theory and empirical investigation in the study of social networks). To that end, we here illustrated the use of the bound sampler to simulate social networks from both empirically fitted ERGMs and from exploratory models. In the former case, we showed that the bound sampler can provide a reasonable proxy for MCMC-based draws, with the added benefits of fixed execution time and quality guarantees mentioned above. In the latter, we showed that the superior scalability of the bound sampler allowed it to be used to simulate very large networks, where current MCMC implementations are too slow to be useful. Given the increasing importance of large-scale networks within sociological research, there is a substantial need for principled techniques for statistical simulation of networks with tens of thousands (or more) vertices; the bound sampler would seem to have considerable virtue in such applications.
Finally, we note that there are many potential extensions and refinements of our basic algorithm (some of which were discussed in Section 2.2). While a drawback of our method is that, unlike the case of MCMC, one cannot improve simulation quality simply by running it longer, techniques such as pseudo-marginalization can be used to improve quality where necessary. Further research on when and where such methods can be used, and on further techniques to enhance simulation quality, would seem to be an important direction.
1This research was supported in part by ONR award N00014-08-1-1015, NIH award 1R01HD068395-01, and NSF awards BCS-0827027 and IIS-1251267.
1It should be borne in mind that β is a random vector, unlike the fixed vectors of α and γ. However, each βi depends only on the outcomes preceding it, and is bounded by αi and γi regardless of those outcomes.
4Ceteris paribus, enhanced degree variance leads to clustering, since there are fewer ways to avoid creating triangles when edges are concentrated on a smaller set of nodes than when edges are evenly dispersed through the graph.