PMCCPMCCPMCC

Search tips
Search criteria 

Advanced

 
Logo of nihpaAbout Author manuscriptsSubmit a manuscriptHHS Public Access; Author Manuscript; Accepted for publication in peer reviewed journal;
 
J Comput Graph Stat. Author manuscript; available in PMC 2017 March 9.
Published in final edited form as:
J Comput Graph Stat. 2016; 25(1): 225–245.
Published online 2016 March 9. doi:  10.1080/10618600.2014.983642
PMCID: PMC5033129
NIHMSID: NIHMS779487

Laplace Variational Approximation for Semiparametric Regression in the Presence of Heteroskedastic Errors

Bruce D. Bugbee, Postdoctoral Fellow
Department of Biostatistics, University of Texas MD Anderson Cancer Center, Houston, TX, 77030 (gro.nosrednadm@eebgubdb)
F. Jay Breidt, Professor
Department of Statistics, Colorado State University, Fort Collins, CO, 80523 (ude.etatsoloc.tats@tdierbj)
Mark J. van der Woerd, Research Scientist

Abstract

Variational approximations provide fast, deterministic alternatives to Markov Chain Monte Carlo for Bayesian inference on the parameters of complex, hierarchical models. Variational approximations are often limited in practicality in the absence of conjugate posterior distributions. Recent work has focused on the application of variational methods to models with only partial conjugacy, such as in semiparametric regression with heteroskedastic errors. Here, both the mean and log variance functions are modeled as smooth functions of covariates. For this problem, we derive a mean field variational approximation with an embedded Laplace approximation to account for the non-conjugate structure. Empirical results with simulated and real data show that our approximate method has significant computational advantages over traditional Markov Chain Monte Carlo; in this case, a delayed rejection adaptive Metropolis algorithm. The variational approximation is much faster and eliminates the need for tuning parameter selection, achieves good fits for both the mean and log variance functions, and reasonably reflects the posterior uncertainty. We apply the methods to log-intensity data from a small angle X-ray scattering experiment, in which properly accounting for the smooth heteroskedasticity leads to significant improvements in posterior inference for key physical characteristics of an organic molecule.

Keywords: mean field variational Bayes, penalized splines, small angle X-ray scattering

1 Introduction

Variational approximations allow for the approximate computation of posterior parameter distributions of complex models in a deterministic fashion (Wainwright and Jordan, 2008). The term mean field variational Bayes (MFVB) refers to a class of variational approximations to posterior distributions under a nonparametric product density constraint (Bishop, 2006; Ormerod and Wand, 2010). Applications of MFVB approximations include mixtures (Attias, 2000), linear mixed models (Ormerod and Wand, 2010), and nonparametric regression with missing data (Faes et al., 2011). Given the connection between MFVB and full posterior parameter conditionals, variational approximations can be directly implemented in situations where all parameters have known posterior conditionals and Gibbs sampling is appropriate (Casella and George, 1992). This has led to useful software packages such as Infer.NET that allow for “black box” computation of variational approximations given fully known posterior conditionals (Minka et al., 2010; Wang and Wand, 2011).

Recent studies have considered non-conjugate models where one or more of the posterior conditionals are not of known form (Carlin and Polson, 1991). Examples of such models include logistic regression (Jaakkola and Jordan, 1997), nonparametric regression with measurement error (Pham et al., 2013), generalized linear mixed models (Tan and Nott, 2013), and correlated topic models (Blei and Lafferty, 2007). Variational approximations of non-conjugate models often have to be derived on a case by case basis and can be difficult to handle. Wand et al. (2011) highlight the use of auxiliary variables, univariate quadrature methods, and finite mixture approximations for addressing hierarchical models with complex distributions. Knowles and Minka (2011) focus on generalizing non-conjugate variational inference for exponential family cases arising from multinomial and binary regression. Salimans and Knowles (2013) and Nott et al. (2013) have recently developed alternatives to standard mean field variational approximations, called structural or fixed-form variational Bayes. These methods have shown promise for addressing non-conjugate problems through the assumption of exponential forms and application of Monte Carlo integration.

One such non-conjugate model arises from semiparametric regression with heteroskedastic errors. Models of this type have been explored in Ruppert et al. (2003), Chan et al. (2006), and Crainiceanu et al. (2007). Using the mixed model formulation of penalized spline regression described in Ruppert et al. (2003), models of this type allow for parametric and nonparametric terms in both the mean and variance levels. Nott et al. (2012) look at the case of a linear fixed effect model with heteroskedastic errors. The work we present here is a novel application of variational approximation methods to the case of semiparametric regression in the presence of heteroskedastic errors, represented as multilevel linear mixed effect models. We use an embedded Laplace approximation to handle the non-conjugacy found in our variational approximation (Wang and Blei, 2013; Nott et al., 2012). Results of Salimans and Knowles (2013) and Nott et al. (2013) could also be applied in this context, though our method uses analytic formulas for the variational moments needed in our special case rather than relying on Monte Carlo integration.

Our work is motivated by data arising from small angle X-ray scattering (SAXS), an imaging technique used in the estimation of physical shape characteristics of complex macromolecules. The general experimental methodology consists of bombarding a sample of the molecule in solution with a beam of X-rays and measuring the resulting scattering pattern on a two-dimensional detector orthogonal to the X-ray beam. The scattering data are reduced from two dimensions to one by averaging intensity (photon counts on the detector) across all of the pixels at a given radial distance from a fixed center. The radii in the detector plane are equivalent to scattering angles relative to the incident X-ray beam: zero radius corresponds to the beam itself, and larger angles correspond to increasing radii. The resulting “intensity curve” is then the set of ordered pairs (angle, radial average intensity at that angle). See Figure 4. The one-dimensional data from SAXS experiments are heteroskedastic, due to characteristics of the molecule in solution and the measurement mechanism; see for example Breidt et al. (2012). SAXS is a low-resolution method in which physical parameters such as maximum linear particle dimension and squared radius of gyration (Rg2) can be estimated. The squared radius of gyration of a particle is defined as the ratio of the particle’s moment of inertia and mass. Appropriate interval estimates for such parameters are of great importance to experimental scientists, who develop theoretical models consistent with the experimental scattering data. Accounting for the heteroskedasticity in the SAXS data is essential for proper inference on these physical parameters, which in turn allows experimental scientists to accurately determine the class of molecular structures that is consistent with a given SAXS dataset.

Figure 4
Constant error (top) and heteroskedastic (bottom) variational approximation fits of experimental SAXS data from a single 7-second exposure of a 4 mg/ml sample of the H2AH2B complex. The gray bands represent 95% pointwise credible intervals generated from ...

Fast analytic methods are desirable for SAXS data. Exposure times on the order of one second produce a single SAXS image and data set, and experiments generating many images across a range of experimental conditions are common. We will demonstrate that our variational/Laplace method is substantially faster than a Markov Chain Monte Carlo alternative.

Unless otherwise stated, all random vectors z are assumed to be continuous with densities of the form p(z). Conditional probability densities will be written as p(z | w) with p(z | ·) representing the density conditional on all other random quantities in the model. Let (xκ)+p=max{0,(xκ)p}. Diagonal matrices are written as diag(a1, …, am) and block diagonal matrices as blockdiag(A1, …, Am). Identity matrices of dimension p are written as p. Finally, let 𝒩p(μ, Σ) be a p-dimensional multivariate Normal distribution and ℐ𝒢(A, B) be an Inverse Gamma distribution with shape and scale parameters A and B.

2 Variational Bayes

We first provide a brief background on the theory of variational approximations. Let y be a response vector and ψ be a vector of parameters. Direct evaluation of p(y) in the posterior parameter distribution

p(ψy)=p(y,ψ)p(y)
(1)

generally requires high dimensional integration, which may not be analytically tractable. A standard approach is to use Markov Chain Monte Carlo (MCMC) methods to approximate p(ψ | y). However, as models grow in complexity, the computational resources needed to perform MCMC increase dramatically and thus fast approximations become valuable.

Let q(ψ) be an arbitrary density function over the parameter space Ψ. Then

logp(y)=Ψq(ψ)logp(y)dψΨq(ψ)log{p(y,ψ)q(ψ)}dψ=:logp(y;q),
(2)

where the inequality follows from Kullback and Leibler (1951), who show the Kullback-Liebler (K-L) divergence q(ψ)log{q(ψ)p(ψy)}dψ0 with equality if and only if q(ψ) = p(ψ | y). The quantity log p(y; q) is commonly known as the evidence lower bound. The closer our choice of q(ψ) is to the true posterior distribution, the smaller the gap between p(y) and p(y; q). Minimizing the gap between these two quantities is equivalent to minimizing the K-L divergence. Variational approximations are useful when analysis of p(y; q) is easier than p(y), which can be achieved by restricting q(ψ) in some manner.

2.1 Defining q(ψ)

The most common restriction imposed on the density q(ψ), which we use here, is that for some partition of the parameter vector ψ = (ψ1, ψ2, …, ψL),

q(ψ)=l=1Lql(ψl).
(3)

Variational approximations that impose such a product-type density restriction on q(ψ) are referred to as mean field variational approximations. When this restriction is used in a Bayesian setting, the term mean field variational Bayes or variational Bayes for short has become standard. Often there is a somewhat natural partition of the parameter vector given the structure of the model of interest. Both Titterington (2004) and Ormerod and Wand (2010) discuss the choice of the partition. Some partitions (e.g. partitions where there is a high degree of posterior dependence between parameters across partitions) can lead to poor approximations. Alternative forms of variational approximations can be achieved by constraining q(ψ) to a class of known parametric density functions but will not be discussed here.

Adopting the notation in Ormerod and Wand (2010), the optimal variational densities, q1(ψ1),q2(ψ2),,qL(ψL), that minimize the K-L divergence described in (2) can be found using the relationship

ql(ψl)exp[Eψl{logp(ψl)}].
(4)

Here p(ψl | ·) is the posterior conditional distribution

p(ψly,ψ1,,ψl1,ψl+1,,ψL).
(5)

The expectation Eψl is with respect to all q-densities except for q(ψl). The dependence of the optimal q-densities on the posterior conditional distributions indicates a close tie to Gibbs sampling (Casella and George, 1992). In fact, when conjugate priors are used for ψ1, …, ψL and the posterior conditionals have known forms, the optimal q-densities will have the same forms but different parameterizations.

Implementing a variational Bayes approximation under these conditions amounts to determining a set of variational parameters that fully define each q-density. Once these variational parameters are determined, the estimates of the model parameters ψ1, …, ψL are taken from the densities q1,,qL, usually as the means associated with each density.

3 Penalized Splines With Heteroskedastic Errors

A simple case of regression models in which both the response and the associated variances are smoothly varying across some underlying set of covariates is

yi=f(xi)+ϵi,ϵi~𝒩(0,σi2),log(σi2)=g(xi)(i=1,2,,N).
(6)

Adopting a penalized spline formulation as described in Ruppert et al. (2003), the nonparametric model takes the form for all i = 1, 2, …, N:

yi=[1,xi,,xip,(xiκ1)+p,,(xiκK)+p][βb]+ϵi=CiT[βb]+ϵib=(b1,b2,,bK)T~𝒩K(0,σb2K)ϵi~𝒩(0,σi2)log(σi2)=[1,xi,,xir,(xiκ1V)+r,,(xiκKVV)+r][δc]=CViT[δc]c=(c1,c2,,cKV)T~𝒩KV(0,σc2KV),
(7)

where β = (β0, β1, …, βp)T and δ = (δ0, δ1, …, δr)T are unknown parameter vectors and {κk}k=1K and {κkV}k=1KV are sets of fixed, known knots.

Define the N × (p + K + 1) and N × (r + KV + 1) design matrices C=[CiT]i=1N and CV=[CViT]i=1N where the ith row corresponds to the ith vector of covariates from (7). Let θ = (βT, bT), θV = (δT, cT), Σ=diag(σ12,,σN2) and V=(log(σ12),,log(σN2))T. It is convenient to view (7) as two mixed models with fixed effects β and δ and random effects b and c. The parameters of interest for the model are θ, θV, σb2, and σc2.

For simplicity, we present our results in the context of the nonparametric model of (7). Minimal changes are required to handle more complex semiparametric structures. Additional parametric covariates can be included by expanding the fixed effect parameter vector and additional non-parametric components or parametric-by-nonparametric interactions can be handled through the inclusion of new random effects with corresponding variance terms.

The truncated polynomial spline basis is chosen both for illustrative purposes and for its usefulness in estimation of a key molecular property from SAXS data, as described in §6. The lack of orthogonality in this basis can cause computational problems, particularly for MCMC methods. Often, P-splines (Eilers and Marx, 1996) or O’Sullivan splines (Wand and Ormerod, 2008) are used as alternatives for their computational stability. Other common choices of basis functions include radial basis functions, B-splines, and thin plate splines, depending on the application. These basis choices affect the definition of the X and Z matrices but not the variational approximation method presented here.

Equation (7) can be considered a submodel of the class of spatially adaptive penalized spline regression with heteroskedastic errors described in Crainiceanu et al. (2007). These models allow for both the error variance (σ2) and the random effects variance (σb2) to be smoothly varying over some set of covariates. Our model is a simplification that holds σb2 as a univariate parameter to be estimated. A limitation of a model of this form is that it can have trouble fitting data in regions where dramatic increases in variation coincide with strong changes in curvature. The inclusion of a spatially adaptive random effect variance allows the model to handle changes in the curvature of the underlying function f(x) better. The work presented here focuses on cases where the underlying first derivative of f(x) appears to vary slowly and does not coincide with sharp increases of variation, as is the case for our motivating application to SAXS data.

3.1 Bayesian Framework

Assume β~𝒩p(0,σβ2p) and δ~𝒩r(0,σδ2r) where σβ2 and σδ2 are given hyperparameters. Assume Inverse Gamma priors σb2~ℐ𝒢(Ab,Bb) and σc2~ℐ𝒢(Ac,Bc) with hyperparameters Ab, Ac, Bb, and Bc. Define Σθ=blockdiag{σβ2p,σb2K} and ΣθV=blockdiag{σδ2r,σc2KV}.

As in Crainiceanu et al. (2007), the model structure and conjugate nature of the priors described here lend themselves to a hybrid Gibbs sampling computational approach to fitting our model. The posterior conditionals of the parameters of interest are

θ~𝒩(MCTΣ1y,M)M=(CTΣ1C+Σθ1)1σb2~ℐ𝒢(Ab+K2,Bb+b22)σc2~ℐ𝒢(Ac+KV2,Bb+c22)p(θV)exp[12{i=1NCViTθV+i=1N(YiCiTθ)2exp(CViTθV)+θVTΣθV1θV}].
(8)

The challenge of fitting this model is due to the non-standard distribution of θV | ·. Crainiceanu et al. (2007) address this problem by introducing a latent error term to the variance model that allows for the sampling from N univariate posterior conditionals, which can be accomplished through traditional Metropolis-Hastings schemes. Another possibility is the use of methods designed for sampling from multivariate parameter distributions, such as the Delayed Rejection Adaptive Metropolis (DRAM) method of Haario et al. (2006).

4 Variational Approximation of Heteroskedastic Model

We now derive an approximation method for fitting the model (7) via variational Bayes. Nott et al. (2012) describe a variational approximation for the standard linear model in the presence of heteroskedasticity of known parametric form. Our approximation applies more broadly, to linear mixed models generally and to semiparametric regression under heteroskedasticity specifically. Unlike Nott et al. (2012), we do not assume a parametric form for the heteroskedasticity, but model it flexibly with penalized splines.

Let ψ=(θ,θV,σb2,σc2)T be a vector containing all parameters of interest. Using the product density constraint from (3), we assume that the variational density q(ψ) is of the form

q(ψ)=q1(θ)q2(θV)q3(σb2)q4(σc2).
(9)

For notational simplicity, the subscript index on each component density of q(ψ) will be dropped in future references (e.g. q1(θ) = q(θ)). The notation ~q will be used to describe the optimal distributions of a parameter derived using the relationship described in (4).

Define Γ = diag(γ1, γ2, …, γN) where γi=Eθ[exp(CViTθV)] corresponds to the moment generating function of θV under the q-density evaluated at −CVi. Each parameter’s q-distribution will be defined by a set of variational parameters, denoted with a subscript q(·). Let μq(1σb2)=Eθ[1σb2], μq(1σc2)=EθV[1σc2], Σq(θ)={blockdiag(1σβ2p,μq(1σb2)K)+CTΓC}1, and μq(θ)=Σq(θ)CTΓy. Using the relationship found in (4) and the posterior conditionals described in (8), the optimal q-distributions for our model are

θ~q𝒩(μq(θ),Σq(θ))σb2~qℐ𝒢(Ab+K2,Bb+12μq(b)2+12trace(Σq(b)))σc2~qℐ𝒢(Ac+KV2,Bc+12μq(c)2+12trace(Σq(c)))q(θV)exp[12{i=1NCViTθV+i=1N((yiCiTμq(θ))2+CiTΣq(θ)TCi)exp(CViTθV)}][{+θVTblockdiag(1σδ2r,μq(1σc2)KV)θV}]=exp{h(θV)}.
(10)

Here μq(b) and Σq(b) refer to the elements of the variational mean and covariance of θ corresponding to the random effects vector b. Define Bq(σb2)=Bb+μq(b)22+trace(Σq(b))2 and Bq(σc2)=Bc+μq(c)22+trace(Σq(c))2. The full derivation of the variational density q is in Appendix A.

The lack of a conjugate form for θV | · from (8) carries through to the form of q*(θV). This makes direct implementation of a variational Bayes approximation difficult since this would involve (r + KV + 1)-dimensional integration. As in Nott et al. (2012) and Wang and Blei (2013), we consider Laplace approximation to replace q(θV) with a multivariate Gaussian density. Consider the Taylor approximation of the function h(θV | ·) from (10) about some value α:

h(θV)h(α)+(θVα)TJθV(α)+12(θVα)THθV(α)(θVα),JθV(α)=xh(x)x=α,HθV(α)=22xh(x)x=α.

Here JθV(α) corresponds to the Jacobian vector of first derivatives of h with respect to θV evaluated at α and HθV(α) corresponds to the Hessian matrix containing all partial second derivatives of h evaluated at α. If α = argmin h(θV | ·), then JθV(α) = 0 and the approximate form of q*(θV), which we denote q~(θV), becomes

q~(θV)exp{12(θVα)THθV(α)(θVα)}.
(11)

This implies that θV has an approximate variational distribution of the form

θV~q𝒩(α,(HθV(α))1).
(12)

Alternatives to Laplace approximation, particularly the delta method, could be used to address the non-conjugate problem surrounding θV (Wang and Blei, 2013). We choose to use Laplace approximations primarily due to simplicity of implementation, since they require only second- order derivative information whereas the delta method requires third-order.

4.1 Algorithm

Using the q-densities described in (10) and (12), constructing a variational Bayes approximation of our model consists of updating the parameters associated with each q-density until q(ψ) has sufficiently approximated p(ψ | ·). This is typically done via a coordinate ascent type algorithm (Ormerod and Wand, 2010; Nott et al., 2012; Wang and Blei, 2013). The variational parameters of interest here are denoted as μq(θ), Σq(θ), μq(θV), Σq(θV), Bq(σb2), and Bq(σc2). During each step of the algorithm, the most recent parameters are used and convergence is not assessed until each full cycle has been completed. Given the Inverse Gamma variational distribution of σb2 and σc2, μq(1σb2)=(Ab+K2)Bq(σb2)1 and μq(1σc2)=(Ac+K2)Bq(σc2)1. The terms μq(b), μq(c), Σq(b), and Σq(c) refer to the portions of the variational means and covariances of θ and θV associated with the random effect vectors.

The calculation of α at each iteration requires the use of a numerical optimization routine. The speed of this optimization procedure determines the overall speed of the algorithm since the other updates have closed forms. For the results presented here, an implementation of the BFGS Quasi-Newton method found in the built-in optim function in R was used for the multivariate optimization problem (Wright and Nocedal, 1999).

The matrices to be inverted in steps 5 and 7 of Algorithm 1 may not be positive definite, perhaps due to poor initial conditions. If an invalid covariance matrix is proposed in a step, we use a ridge adjustment, adding to each diagonal entry of the matrix to be inverted twice the absolute value of its smallest eigenvalue. None of the results presented here required these adjustments.

An external file that holds a picture, illustration, etc.
Object name is nihms-779487-f0001.jpg

4.2 Convergence Criteria

In fully conjugate variational models such as those in Ormerod and Wand (2010), iteratively updating the variational parameters according to (4) guarantees monotonic increases of log p(y; q) as it converges to a (local) maximum. Embedding a Laplace approximation to handle the non-conjugate structure of q(θV) removes this monotonic behavior. Use of an embedded approximation procedure in the context of a variational approximation is generally accepted, however, and can be thought similar to using Metropolis-Hastings within a Gibbs sampler. Nott et al. (2013) formally justify convergence of such an embedded optimization in the context of a fixed form variational Bayes algorithm within a mean field variational Bayes approximation. This is quite similar to our current setting and we therefore conjecture that parameter updates from Algorithm 1 will converge. Empirical results described below support this conjecture.

The evidence lower bound log p(y; q) is the objective function monitored to assess convergence. Wang and Blei (2013) suggest monitoring ||μq(θV)|| to assess convergence as well. Appendix B contains the derivation of log p(y; q).

5 Simulation Example

To illustrate our methodology, consider two simulated datasets of N = 200 both with a true mean function of m(x) = −(1/8)(x − 5)3 + x and with true variance functions of v1(x) = {(1/4)x + (1/2)} or v2(x) = exp {(1/5)(x − 5)2}, with x [set membership] [0, 10]. We fit a model of the form described in (7) using a piecewise quadratic basis (p = r = 2). For both the mean and variance levels, K = KV = 10 knots were placed at equally spaced quantiles of x (Ruppert et al., 2003). The hyperparameters associated with the priors for σb2 and σc2 are Ab = Bb = Ac = Bc = 10−5. The variance hyperparameters associated with the fixed effects β and δ are σβ2 = σδ2 = 105.

Before the algorithm is started, an initial fitting of y under a homoskedastic assumption is performed using the same model structure. The corresponding θ^ is used as an initial value of μq(θ). The matrix Σq(θ) is initialized using a diagonal matrix of appropriate scale for the data, and Bq(σb2) and Bq(σc2) are initialized to 1/100.

Figure 1 shows the resulting fit of the simulated data under both variance structures using our variational approximation. The approximation closely follows the true mean function in both examples but encounters increased difficulty in areas with higher noise. The shaded areas represent 95% pointwise credible bounds determined by taking a sample of size 10000 from the converged variational distribution of q*(θ). This measure of uncertainty reflects the non-constant variance structure with wider bands being associated with areas of higher variance. The true mean function m(x) was fully covered by the pointwise credible bounds associated with each variance function.

Figure 1
N = 200 simulated observations with true mean function m(x) = −(1/8)(x − 5)3 + x (dashed line in each plot). Variance functions are v1(x) = {(1/4)x + (1/2)}3 (top) and v2(x) = exp {(1/5)(x − 5)2} (bottom). Solid lines represent ...

Figure 2 shows the estimated log variance functions log(v1(x)) and log(v2(x)) associated with the data simulated from m(x). The estimated curves associated with our variational approximation (solid lines) closely follow the true log variance functions (dashed lines). The 95% pointwise credible bounds calculated using 10000 draws from the variational distribution q*(θV) cover the true variance function over the range of x in both cases.

Figure 2
The solid lines represent the estimated log variance functions associated with the variational approximation detailed in Figure 1. The shaded area corresponds to the 95% pointwise credible bounds calculated from 10000 draws from the variational distribution ...

5.1 Comparison to MCMC

We are using an approximate form of the optimal variational density associated with θV and this variational method is itself an approximation. We now consider the quality of our approximation, as compared to the posterior distribution evaluated using a combination of Gibbs sampling and an appropriate multivariate posterior sampling algorithm; in this case, DRAM (see Section 3.1). Figure 3 compares the 95% pointwise credible intervals for both the mean and log variance curves using a simulated sample of 200 points with true curves m(x) and v1(x). The MCMC procedure was run for 10000 iterations with a burn-in length of 1000. The DRAM step consists of two stages of potential proposal acceptance at each iteration with covariance adaptation every 100 steps. All other model choices (basis functions, knots, etc.) are the same as the previous simulated results.

Figure 3
The solid light gray regions represent the pointwise 95% credible bounds drawn from the approximating variational distributions qθ and q~θV for a simulated data set of 200 observations with m(x) and v1(x) as the true mean and variance ...

When comparing the coverage of the true mean curve, both the variational approximation and the MCMC method are very close. This is not surprising given the conjugate structure of the θ parameter that controls the estimate of m(x). The posterior conditional described in (8) for θ is multivariate Gaussian as is the variational density qθ. The coverage comparison for the log variance curve is more involved. As in Figure 2, the credible intervals from the variational approximation behave smoothly and cover the true function log v1(x). The credible bounds from the MCMC procedure have a more pronounced curvature and are narrower in some places. Also, the MCMC bands have some trouble covering log v1(x). This behavior can most likely be attributed to the difficulty of the multivariate sampling problem for the posterior conditional distribution p(θV | ·). In this case p(θV | ·) describes a 13-dimensional distribution from which we have to draw samples. While DRAM performs significantly better than standard Metropolis-Hastings for this problem, it can still have difficulty fully covering the parameter space, especially if the space is complicated and potentially multi-modal. Improvements in fit may be achievable through considerable tuning (modifying adaption length, covariance scaling, etc.) or implementing a more sophisticated algorithm to handle the multivariate sampling problem. These results highlight the additional advantage that variational approximations provide in that they do not require specification of tuning parameters. The coverage comparisons for both the mean and log variance estimates between these two methods suggests that our method does not suffer from the potential underestimation of uncertainty sometimes associated with variational approximations (Rue et al., 2009).

5.2 Computational Performance

Convergence of Algorithm 1 is assessed by monitoring the relative change of log p(y; q). Variational algorithms are known to converge very quickly to appropriate estimates (Ormerod and Wand, 2010). Table 1 contains the run times for the variational approximations associated with each variance curve at increasing sample sizes, N. All computations were performed on a MacBook Pro with a 2.3 GHz Intel Core i5 processor and 4 GB of RAM. For all results in Table 1, seven or eight iterations of the algorithm were sufficient to achieve convergence. For comparison, the run time for 10000 iterations of the MCMC procedure described previously are included in Table 1. Our algorithm is slower than those that operate in fully conjugate situations since we have a numerical optimization embedded at each step to estimate α from the most current h(θV | ·). However, there is still a dramatic speed increase compared to the MCMC method with run times for the simulated cases being approximately 20 to 60 times faster, depending on sample size, using the variational approximation. This reduction of computational cost coupled with the lack of manual tuning parameters demonstrate the value of these variational approximations. Further computational improvements may be achieved through the use of a custom optimization routine rather than off-the-shelf implementations.

Table 1
Run times (seconds) for the variational approximations and MCMC procedures described in Section 5 at various sample sizes. The MCMC procedure was run for 10000 iterations in each case. All variational approximations required seven or eight iterations ...

6 Application to Inference from SAXS

The model (7) is directly motivated by the analysis of one-dimensional scattering intensity curves from small angle X-ray scattering (SAXS) experiments (reviewed in Putnam et al. (2007)). The variance of SAXS data depends on angle, due to both characteristics of the molecule and characteristics of the detector. The boundary regions of the detector, associated with larger angles of X-ray scattering, have a smaller signal-to-noise ratio since less scattering occurs at those large angles. This causes greater pixel-to-pixel variation for the intensity values in these regions.

Figure 4 shows data from a single SAXS exposure and the resulting variational approximation fits of the nonparametric mean function, using the approach of Ormerod and Wand (2010) for a model with constant errors (top panel) and using our approach for a model with heteroskedastic errors (bottom panel). The data correspond to the log intensity values of a 7-second exposure of a 4 mg/ml sample of the H2AH2B complex (Isenberg, 1979). The x-axis of Figure 4 represents the radial distance from the center of the sensor. For this problem, we used K = 40 equally-spaced knots κ1, …, κ40 over [0.03, 0.27] to evaluate spline functions for both the mean and log variance curves.

The nonparametric regression model uses a truncated quadratic polynomial basis as in (7), with the exception that there is no linear fixed effect term in the mean (β1 = 0). This is theoretically justified from physical considerations. Provided the first knot κ1 is sufficiently close to the origin, a classical analysis yields that expected log intensity at distance x is β0(13)(Rg2)x2, where Rg2 is the squared radius of gyration of the molecule (Guinier, 1939). The use of a truncated quadratic basis is therefore particularly convenient for the estimation of Rg2=3β2, where β2 is the coefficient of the x2 fixed effect. While truncated polynomial basis functions are highly collinear, we did not experience noticeable computational problems arising from our choice of this basis. As mentioned in Eilers and Marx (2010), the singular value decomposition of the random effects matrix Z can aid in understanding the numerical stability of this basis. We do not observe any significant computational or numerical problems in using the truncated quadratic basis for the SAXS data.

The mean estimates for both the constant error and heteroskedastic models fit the data equally well, with mean squared errors of 17.36 and 17.02 respectively. As shown in Figure 4, the uncertainty bands associated with the constant error model are fairly uniform across x. The heteroskedastic version, on the other hand, has narrow uncertainty bounds at the low to mid regions of x, reflecting the pronounced smoothness, while being wider in the large x regions. This behavior better reflects the true uncertainty associated with data about the underlying smooth mean function across the covariate x. This translates to more appropriate uncertainty estimates for the regression coefficients that comprise θ.

Appropriately accounting for the error structure of SAXS data can impact inference on important physical parameters. Here we consider estimation of the radius of gyration, Rg2=3β2, and the intercept of the log intensity curve, β0, which is known to be linearly related to log molecular weight, given fixed concentration and other experimental parameters (Fischer et al., 2009). Our variational method yields the approximate distributions

7 Discussion

Variational approximations are increasingly popular tools for approximating computationally difficult problems in an efficient manner. Recent work in the literature has focused on the application of mean field variational Bayes methods to non-conjugate problems. This includes the problem presented here: semiparametric regression with heteroskedastic errors. Our work provides a fast variational approximation by leveraging a Laplace approximation to replace the non-conjugate portion of our model with an appropriate Gaussian distribution. Simulation results show that our approximation performs well at fitting both the mean and variance functions, providing posterior inferences comparable to those of a sophisticated MCMC, at much lower computational cost. The ability to fit variance functions accurately is important in accounting for the true uncertainty in nonparametric fits of mean level data, as we have shown in obtaining corrected interval estimates for key physical parameters of H2AH2B using SAXS imaging data.

Under the mixed model framework, extension of our variational approximation only relies on the inclusion of additional random effect variance terms. While not shown explicitly here, our variational approximation readily handles more complex heteroskedastic models. An area of future interest would be further investigation of how to use our variational method for model comparisons in a semiparametric regression context. Nott et al. (2012) highlight the use of the variational lower bound log p(y; q) for model comparison. We are interested in the use of log p(y; q) in interpreting the significance of parametric-by-nonparametric interaction terms, which have scientific meaning in SAXS data analysis. For example, a significant interaction between parametric exposure time and nonparametric angle is suggestive of radiation damage to the molecule in solution.

Radial variance data are typically available alongside radial average data in a SAXS experiment. We are currently extending our variational approximations to handle joint models of radial average data and radial variance data in a complete SAXS experiment, consisting of repeated samples with varying concentrations and exposure times. The goal is to provide a fast, complete framework for the analysis of SAXS data and provide more appropriate uncertainty estimates for inference on structural features.

Table 2
Estimated intercept β^0 and squared radius of gyration Rg2^ of the log intensity curve for the complex H2AH2B arising from variational approximations of both constant-error and het-eroskedastic models. The corresponding credible bounds are derived ...

Supplementary Material

Supplementary Material

Acknowledgment

This research was supported by the Joint NSF/NIGMS Initiative to Support Research in the Area of Mathematical Biology (R01GM096192). The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institute of General Medical Sciences or the National Institutes of Health. This work was completed as a part of Bruce Bugbee’s Ph.D. dissertation work while a student at Colorado State University.

A Derivation of q

The following section contains the derivation of the q-densities described in (10).

A.1 qθ

Using the form of p(θ | ·) from (8), we can derive qθ:

qθexp[Eθ{logp(θ)}]exp[Eθ{12(θMCTΣ1y)TM1(θMCTΣ1y)}]exp{12Eθ(θTM1θyTΣ1CθθTCTΣ1y)}.

The quadratic term is

Eθ(θTM1θ)=Eθ{θT(Σθ1+CTΣ1C)θ}=θT{blockdiag(1σβ2p,μq(1σb2)K)+CTΓC}θ.

Define Σq(θ)={blockdiag(1σβ2p,μq(1σb2)K)+CTΓC}1. Here Γ = diag(γ1, γ2, …, γN). In this context γi=Eθ{exp(CViTθV)} corresponds to the moment generating function of qθV evaluated at the vector CViT.

The expected value of the linear term is

Eθ(yTΣ1Cθ)=Eθ{i=1Nyi(CiTθ)1σi2}=yTΓCθ.

Using these results and completing the square, we arrive at the form of the optimal qθ:

qθexp{12(θΣq(θ)CTΓy)TΣq(θ)1(θΣq(θ)CTΓy)}θ~q𝒩(μq(θ)Σq(θ)CTΓy,Σq(θ)).

A.2 qσb2

The variational density for σb2 is

qσb2exp[Eσb2{logp(σb2)}]exp[Eσb2{(Ab+K2)log(σb2)1σb2(Bb+b2)}](σb2)AbK21exp[1σb2{Bb+Eσb2(b2)}](σb2)AbK21exp[1σb2{Bb+μq(b)2+trace(Σq(b))}]σb2~qℐ𝒢(Ab+KV2,Bq(σb2)Bb+12{μq(b)2+trace(Σq(b))}).

Here μq(b) and Σq(b) refer to the portion of μq(θ) and Σq(θ) associated with the random effects parameters. Under this variational distribution,

E[1σb2]=Ab+K2Bq(σb2).

A.3 qσc2

Given the similar structure between p(σb2 | ·) and p(σc2 | ·), the derivation of qσc2 is the same:

qσc2(σc2)AcKV21exp[1σc2{Bc+μq(c)2+trace(Σq(c))}]σc2~qℐ𝒢(Ac+KV2,Bq(σc2)Bc+12(μq(c)2+trace(Σq(c)))).

As before, μq(c) and Σq(c) refer to the portion of μq(θV) and Σq(θV) associated with the random effects parameters. As with 1σb2,

E[1σc2]=Ac+K2Bq(σc2).

A.4 qθV

As highlighted in (8), the posterior conditional p(θV | ·) does not have a known form. Using (4), the optimal q-density for θV is

qθVexp[EθV{logp(θV)}]exp(12[i=1NCViT]θV+i=1N{(yiCiTμq(θ))2+CiTΣq(θ)Ci}exp(CViTθV))([+θVTblockdiag(1σδ2r,μq(1σc2)KV)θV]).

B Derivation of log p(y; q)

Let Γ(x) denote the standard gamma function. Recall that the density q(ψ) is assumed to have the product density structure described in Section 4. Then

logp(y;q)=Ψq(ψ)[logp(y,ψ)logq(ψ)]dψ=Ψq(ψ)[logp(yψ)+logp(ψ)logq(ψ)]dψ=Ψq(ψ)[logp(yψ)+logp(θ)+logp(σb2)+logp(θV)+logp(σc2)logq(ψ)]dψ=12log(2π)Np2log(σβ2)r2log(σδ2)+Ablog(Bb)log(Γ(Ab))+Aclog(Bb)log(Γ(Ac))+12log(Σq(θ))+12log(Σq(θV))(Ab+K2)log(Bq(σb2))(Ac+KV2log(Bq(σc2)))+log(Γ(Ac+KV2))+log(Γ(Ab+K2))+12σβ2(μq(β)2+trace(Σq(β)))+12σδ2(μq(δ)2+trace(Σq(δ)))12i=1N{(yiCiTμq(θ))2+CiTΣq(θ)Ci}exp(CViTμq(θV)+12CViTΣq(θ)CViT)+p+K2+r+KV2.

Footnotes

Supplemental Materials: Variational Method Code. R code for performing the variational approximation method described in this paper is included. Also with this code are instructions for creating the simulated data sets described here as well as the experimental SAXS data used. (Zip file)

References

  • Attias H. A Variational Bayesian Framework for Graphical Models. Advances in Neural Information Processing Systems. 2000;12:209–215.
  • Bishop CM. Pattern Recognition and Machine Learning. Information Science and Statistics, Springer; 2006.
  • Blei DM, Lafferty JD. A Correlated Topic Model of Science. The Annals of Applied Statistics. 2007;1:17–35.
  • Breidt FJ, Erciulescu A, van der Woerd M. Autocovariance Structures for Radial Averages in Small-Angle X-Ray Scattering Experiments. Journal of Time Series Analysis. 2012;33:704–717. [PMC free article] [PubMed]
  • Carlin BP, Polson NG. Inference for Nonconjugate Bayesian Models Using the Gibbs Sampler. Canadian Journal of Statistics. 1991;19:399–405.
  • Casella G, George EI. Explaining the Gibbs Sampler. The American Statistician. 1992;46:167–174.
  • Chan D, Kohn R, Nott D, Kirby C. Locally Adaptive Semiparametric Estimation of the Mean and Variance Functions in Regression Models. Journal of Computational and Graphical Statistics. 2006;15:915–936.
  • Crainiceanu CM, Ruppert D, Carroll RJ, Joshi A, Goodner B. Spatially Adaptive Bayesian Penalized Splines with Heteroscedastic Errors. Journal of Computational and Graphical Statistics. 2007;16:265–288.
  • Eilers PH, Marx BD. Splines, Knots, and Penalties. Wiley Interdisciplinary Reviews: Computational Statistics. 2010;2:637–653.
  • Eilers PHC, Marx BD. Flexible Smoothing with B-splines and Penalties. Statistical Science. 1996;11:89–102.
  • Faes C, Ormerod J, Wand M. Variational Bayesian Inference for Parametric and Nonparametric Regression with Missing Data. Journal of the American Statistical Association. 2011;106:959–971.
  • Fischer H, Oliveira Neto M. d., Napolitano H, Polikarpov I, Craievich A. Determination of the Molecular Weight of Proteins in Solution from a Single Small-Angle X-Ray Scattering Measurement on a Relative Scale. Journal of Applied Crystallography. 2009;43:101–109.
  • Guinier A. La Diffraction des Rayons X aux Très Petits Angles: Application a l’É tude de Phénoménes Ultramicrosopiques. Annales de Physique. 1939;12:161–237.
  • Haario H, Laine M, Mira A, Saksman E. DRAM: Efficient Adaptive MCMC. Statistics and Computing. 2006;16:339–354.
  • Isenberg I. Histones. Annual Review of Biochemistry. 1979;48:159–191. [PubMed]
  • Jaakkola T, Jordan M. A Variational Approach to Bayesian Logistic Regression Models and Their Extensions. Sixth International Workshop on Artificial Intelligence and Statistics.1997.
  • Knowles DA, Minka T. Non-Conjugate Variational Message Passing for Multinomial and Binary Regression. Advances in Neural Information Processing Systems. 2011:1701–1709.
  • Kullback S, Leibler RA. On Information and Sufficiency. The Annals of Mathematical Statistics. 1951;22:79–86.
  • Minka T, Winn J, Guiver J, Knowles D. Infer.NET 2.4. Microsoft Research Cambridge. 2010 http://research.microsoft.com/infernet.
  • Nott DJ, Tran M-N, Kuk AY, Kohn R. Efficient Variational Inference for Generalized Linear Mixed Models with Large Datasets. arXiv preprint arXiv:1307.7963. 2013
  • Nott DJ, Tran M-N, Leng C. Variational Approximation for Heteroscedastic Linear Models and Matching Pursuit Algorithms. Statistics and Computing. 2012;22:497–512.
  • Ormerod J, Wand M. Explaining Variational Approximations. The American Statistician. 2010;64:140–153.
  • Pham TH, Ormerod JT, Wand M. Mean Field Variational Bayesian Inference for Nonparametric Regression with Measurement Error. Computational Statistics & Data Analysis. 2013;68:375–387.
  • Putnam C, Hammel M, Hura G, Tainer J. X-Ray Solution Scattering (SAXS) Combined with Crystallography and Computation: Defining Accurate Macromolecular Structures, Conformations and Assemblies in Solution. Quarterly Reviews of Biophysics. 2007;40:191–285. [PubMed]
  • Rue H, Martino S, Chopin N. Approximate Bayesian Inference for Latent Gaussian models by Using Integrated Nested Laplace Approximations. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 2009;71:319–392.
  • Ruppert D, Wand MP, Carroll RJ. Semiparametric Regression. Vol. 12. Cambridge University Press; 2003.
  • Salimans T, Knowles DA. Fixed-Form Variational Posterior Approximation Through Stochastic Linear Regression. Bayesian Analysis. 2013;8:837–882.
  • Tan LS, Nott DJ. Variational Inference for Generalized Linear Mixed Models Using Partially Noncentered Parametrizations. Statistical Science. 2013;28:168–188.
  • Titterington D. Bayesian Methods for Neural Networks and Related Models. Statistical Science. 2004;19:128–139.
  • Wainwright MJ, Jordan MI. Graphical Models, Exponential Families, and Variational Inference. Foundations and Trends in Machine Learning. 2008;1:1–305.
  • Wand M, Ormerod J. On Semiparametric Regression with O’Sullivan Penalized Splines. Australian & New Zealand Journal of Statistics. 2008;50:179–198.
  • Wand MP, Ormerod JT, Padoan SA, Frührwirth R. Mean Field Variational Bayes for Elaborate Distributions. Bayesian Analysis. 2011;6:1–48. [PubMed]
  • Wang C, Blei DM. Variational Inference in Nonconjugate Models. The Journal of Machine Learning Research. 2013;14:1005–1031.
  • Wang S, Wand M. Using Infer.NET for Statistical Analyses. The American Statistician. 2011;65:115.
  • Wright S, Nocedal J. Numerical Optimization. Vol. 2. Springer; New York: 1999.