PMCCPMCCPMCC

Search tips
Search criteria 

Advanced

 
Logo of nihpaAbout Author manuscriptsSubmit a manuscriptNIH Public Access; Author Manuscript; Accepted for publication in peer reviewed journal;
 
Comput Stat Data Anal. Author manuscript; available in PMC Nov 30, 2010.
Published in final edited form as:
PMCID: PMC2994020
NIHMSID: NIHMS238462
A Flexible Approach to Bayesian Multiple Curve Fitting
Carsten H. Botts and Michael J. Daniels
Carsten H. Botts, Department of Mathematics and Statistics, Williams College, Williamstown, Massachusetts 01267, USA, cbotts/at/williams.edu;
We model sparse functional data from multiple subjects with a mixed-effects regression spline. In this model, the expected values for any subject (conditioned on the random effects) can be written as the sum of a population curve and a subject-specific deviate from this population curve. The population curve and the subject-specific deviates are both modeled as free-knot b-splines with k and k′ knots located at tk and tk, respectively. To identify the number and location of the “free” knots, we sample from the posterior p (k, tk, k′, tk|y) using reversible jump MCMC methods. Sampling from this posterior distribution is complicated, however, by the flexibility we allow for the model’s covariance structure. No restrictions (other than positive definiteness) are placed on the covariance parameters ψ and σ2 and, as a result, no analytical form for the likelihood p (y|k, tk, k′, tk) exists. In this paper, we consider two approximations to p(y|k, tk, k′, tk) and then sample from the corresponding approximations to p(k, tk, k′, tk|y). We also sample from p(k, tk, k′, tk, ψ, σ2|y) which has a likelihood that is available in closed form. While sampling from this larger posterior is less efficient, the resulting marginal distribution of knots is exact and allows us to evaluate the accuracy of each approximation. We then consider a real data set and explore the difference between p(k, tk, k′, tk, ψ, σ2|y) and the more accurate approximation to p(k, tk, k′, tk|y).
Keywords: B-splines, Laplace approximation, Reversible jump MCMC, Unit-information prior
Social and physical scientists are often interested in how certain variables depend on one another. They often assume that a functional relationship exists between these two variables, and they run experiments or collect samples to learn about this relationship. They are aware, however, that the data does not follow a deterministic equation, but follows the general stochastic equation
equation M1
(1)
where Y is a dependent variable, x is an independent or controlled variable, and ε is random error with mean 0. One of the simplest models relating a response, Y, to a single predictor, x, is the pth–order polynomial regression model
equation M2
While this model is appealing and can be very useful, it may be inadequate since it assumes that only one polynomial describes the average relationship between Y and x. A more complex model would allow several polynomials to do this. An example of such a model is a piecewise polynomial regression spline. A piecewise polynomial regression spline with k knots divides the domain of f(x) into k + 1 mutually exclusive regions, and to each region corresponds a unique polynomial describing the average relationship between Y and x. With different polynomials describing different parts of the function, f(x) is not restricted to have the same smoothness throughout its domain. A piecewise polynomial regression spline of order p with k knots at tk = (t1, t2, …, tk) is written as
equation M3
(2)
where β = (β0, β1, …, βp+k) is a fixed set of parameters and equation M4.
A natural polynomial regression spline takes a form nearly identical to the piecewise regression spline seen above. It is different, however, in that it restricts the function to be linear at the boundaries. This sets the last term on the right-hand side of (2) to 0. A natural polynomial regression spline of order p thus takes the form
equation M5
If f (x) were modeled as such, (1) could then be re-written as
equation M6
where equation M7 and β = (β0, β1, β2, …, β1+k), and for m observations collected at (x1, x2, …, xm), it would be written as
equation M8
(3)
The natural polynomial regression spline written in Equation (3) illustrates the concept behind splines, but the design matrix composed of the row vectors c (x1, tk), …, c (xm, tk) is unstable and thus rarely used. B-splines, which are computationally more stable, are preferred (Hastie and Tibshirani, 1990; Zhou and Shen, 2001). The functional form of a b-spline is more complex than that of the natural polynomial regression spline, but is readily available (deBoor, 1978). We write it as
equation M9
where b(x, tk) is the design vector associated with a b-spline evaluated at x with k knots at tk = (t1, t2, …, tk), and α is a fixed set of parameters. For m observations collected at (x1, …, xm), the model would be written as
equation M10
With this model, estimating f(x) becomes a problem of estimating the number of knots, k, the locations of those knots, tk, and α.
A variety of methods have been developed in estimating k and tk. Halpern (1973) approached this problem using Bayesian methods. Allowing knots to only be placed at the design points in the experiment, he considered all of the subsets of the design points. Halpern assigned prior probabilities to all of these subsets, and calculated the corresponding posterior probabilities of these subsets. His estimator of the function was based on these posterior probabilities. Denison, Mallick, and Smith (1998) placed priors on the number of knots, k, and their locations, tk. With these priors, they calculated a joint posterior distribution which included the variables k and tk and then sampled from this posterior distribution using reversible jump MCMC methods. They too restricted the knots to be located only at the design points of the experiment. DiMatteo, Genovese, and Kass (2001) proposed a method similar to that of Denison, et al. They did not restrict the knots to be located only at the design points of the experiment and they penalized models with unnecessarily large values of k by averaging over the fixed effect coefficients, α.
Friedman (1991) and Stone et al. (1997) try to solve the problem of knot selection using backward and/or forward knot selection. This process continues until the “best” model has been identified. Zhou and Shen (2001) used an alternative method in identifying the locations of the knots. They constructed an algorithm which favored adding knots in locations where several knots had already been added. Lindstrom (1999) used similar methods when selecting knot locations.
Splines have also been used to model curves that vary within and between subjects (Shi et al., 1996, Rice and Wu, 2001, Behseta et al., 2005, Crainiceanu et al., 2005). Brumback and Rice (1998) use splines to model nested functions which vary between subjects and between groups of subjects. Functional models which only account for variation between and within subjects take the general form
equation M11
(4)
where Yi (xj) is the observation of the ith individual at xj, f (x) can be thought of as a population curve, and Gi (x) is a random curve specific to subject i. While these functions can be modeled in a variety of ways, we model these two functions as free-knot b-splines with k and k′ knots located at tk and tk, respectively. Setting f(x) = b(x; tk)α and Gi(x) = b(x; tki, where γi are independent random vectors such that γi ~ N(0, Σγ), equation (4) becomes
equation M12
(5)
Bigelow and Dunson (2007) consider a variation of this model and identify the location of the knots by sampling from the full posterior distribution. They lessen the computational burden of this sampling procedure by setting tk = tk and restricting Σγ to be diagonal. In this paper, we consider a more flexible model and explore the computational issues associated with it. To be more specific, we eliminate the restrictions imposed by Bigelow and Dunson and then try to identify the number and location of the knots by sampling from p(k, tk, k′, tk|y). This posterior can not be sampled from exactly, however, because eliminating the diagonal restriction on Σγ makes the likelihood p (y|k, tk, k′, tk) intractable. We thus sample from two approximations to this posterior; each approximation corresponds to a different approximation to the likelihood p (y|k, tk, k′, tk). To assess the accuracy of these approximations, we compare the posterior distribution of knots in each case to the marginal distribution of knots when sampling from p(k, tk, k′, tk, ψ, σ2|y), where ψ is a set of covariance parameters and σ2 is the within-subject variability. The marginal distribution of knots when sampling from this higher-dimensional posterior is exact as its corresponding likelihood is available in closed form. Sampling from this larger posterior, however, is less efficient because of the additional parameters that need to be sampled.
In Section 2, we discuss the linear model given in (5) in more detail. In Section 3, the likelihood of interest, p(y|k, tk, k′, tk), is discussed as are the two methods we use to approximate this likelihood. Section 4 shows the algorithm used to sample from the resulting approximate posterior distributions, and Section 5 shows the algorithm we use to sample from p(k, tk, k′, tk, ψ, σ2|y). Section 6 theoretically compares the approximations, and in Section 7, we describe a simulation study performed to explore the differences in these approximations. In Section 8, we apply the preferred approximation to a real data set.
Let Yi (xj) be the observed value of the ith curve (i = 1, …, n) at xj. We specify the following mixed model for functional data,
equation M13
(6)
where b(xj, tq) is the design vector of a b-spline with q knots at tq, k is the number of fixed effect knots at locations tk, k′ is the number of random effect knots at tk, α is a fixed set of parameters with dim (α) = k+2, γi are independent and identically distributed random vectors such that γi ~ N (0, Σγ) and dim (γi) = k′+2, εij are independent random variables with εij ~ N(0, σ2) for all pairs (i, j), and the random vectors γi are independent of all random variables εij. The appealing feature of this model is its flexibility; recall that no restriction is being placed on Σγ (other than |Σγ| > 0), and that the random and fixed effect knots are not set equal to one another. We do assume, however, that all individuals have the same random effect knots, and that k′ ≤ k. When the number of observations per subject is small, such assumptions are needed, as they limit the number of complicated models proposed in the Markov chain we consider.
We place prior distributions on k, k′, tk, tk, α, Σγ, and σ2. Poisson priors are given to k and k′. Additionally, since there is no reason to favor knots at any particular location on the domain of f and G, flat priors are placed on both tk and tk. These priors are written as
equation M14
where tq(j) is the jth smallest knot in the vector tq, and (a, b) is the domain of f and G.
For Σγ and σ2, we place prior distributions on transformed values of these parameters. The transformations we consider create a set of unconstrained parameters which are preferable to work with since their sampling distributions can be well approximated with the normal density. For Σγ, we applied the modified Cholesky parameterization proposed by Pourhamadi (1999). Pourhamadi decomposes the inverse of the covariance matrix Σγ as equation M15, where
equation M16
These parameters can be calculated directly from the elements of the covariance matrix using the functions g (·;·) and h (·;·), where equation M17 for 2 ≤ jp equation M18, ϕj = (ϕj,1, ϕj,2, …, ϕj,j−1)′ = hγ;j) = equation M19, Σγ,j is the minor matrix within Σγ composed of its first j − 1 columns and rows, and σj is the vector within Σγ composed of the first j – 1 elements of the jth column. To ease our notational burden, we will refer to these modified Cholesky parameters collectively as ψ, where equation M20, and we will replace the symbol Σγ with ψ (or, to illustrate that Σγ is a function of ψ, with Σγ(ψ)). All of the elements in the vector ψ are unconstrained, and a multivariate normal unit-information prior is assigned to them. The mean of the unit-information prior on ψ is the maximum likelihood estimate of the parameters, [psi], and the variance is inversely proportional to approximately one unit of information on ψ. The prior can be written as
equation M21
where
equation M22
p(y|k, tk, k′, tk, α, ψ, σ2) is the distribution of all the observed data in the study, n is the number of subjects in the study, and [alpha], [psi], and [sigma with hat] are the maximum likelihood estimates of α, ψ, and σ, respectively. The exact formula for Iψ is given in the Appendix.
For σ2, a unit-information prior was placed on log (σ2). The prior can be written as
equation M23
where
equation M24
and mi is the number of observations for subject i. The exact formula for Ilog(σ2), i is given in the Appendix.
The prior placed on α is another unit-information prior and can be written as
equation M25
where
equation M26
The exact formula for Iα,i is given in the Appendix.
Kass and Wasserman (1994) and Pauler (1998) defend the use of unit-information priors, especially in the context of model selection. They show that when using such priors, the resulting Bayes factors can be well approximated with the Bayesian information criterion. Although centering unit-information priors at the maximum likelihood estimates has been suggested in Bayesian adaptive regression splines (Paciorek, 2006), the final results presented in this paper were not sensitive to the means selected for these three prior distributions.
Estimating the fixed and random effect functions now becomes a problem in estimating k, tk, k′, and tk. We do this by sampling from the posterior distribution p (k, tk, k′, tk|y). While this procedure is intuitively appealing, a problem arises since the likelihood p (y|k, tk, k′, tk) can not be calculated analytically. We consider two approximations to this likelihood in the next section.
The likelihood of interest can be expressed as
equation M27
(7)
where
equation M28
(8)
xi = (x1i, x2i, …, xmi) is the vector containing the x–values corresponding to subject i, n is the total number of subjects (curves) in the study, and mi is the total number of observations for subject i.
The integral in (7) can not be calculated analytically. We thus explore two different approximations to p (y|k, tk, k′, tk). The first simply plugs in the maximum likelihood estimate of ψ and σ2. The second approximates (7) using a Laplace approximation.
The first approximation that we consider estimates p (y|k, tk, k′, tk) with
equation M29
where [psi] (tk, tk) and [sigma with hat]2 (tk, tk) are the maximum likelihood estimators of ψ and σ2 corresponding to the model with fixed effect knots at tk and random effect knots at tk. The dependence of these maximum likelihood estimators on (tk, tk) will be suppressed, and the estimators will now simply be denoted as [psi] and [sigma with hat]2. We refer to this likelihood approximation as the “plugged-in” approximation and denote it as [p with hat]Plugged In (y|k, tk, k′, tk). Although this approximation ignores the penalty from increasing the dimension of ψ, it is computationally inexpensive and is similar to the methods employed by Taplin (1993), Draper (1995), Raftery et al. (1996), and Dominici et al. (2002). They approximate marginal likelihoods by plugging in the maximum likelihood estimates of the nuisance parameters that can not be integrated out.
An alternative, but computationally more complex approach, is to integrate out ψ and σ2 using an approximation method. We use a Laplace approximation to estimate p (y|k, tk, k′, tk). The approximate likelihood is given by
equation M30
where
equation M31
and dim(ψ) = (k′ + 3)(k′ + 2)/2. We chose the unconstrained parameterization for Σγ, ψ, and σ2, log(σ2), to make this Laplace approximation more accurate. The optimal parameterization is an area of future research. Calculating equation M32 makes the computational expense of the Laplace approximation greater than that of the “plugged-in” approximation; the matrix of second derivatives takes time to compute and can be numerically unstable when dim(ψ) is large and the total number of subjects, n, is small. We still expect, however, that the Laplace approximation will be more accurate than the “plugged-in” approximation. A major question of interest addressed in this paper is whether the Laplace’s gain in accuracy is worth its computational cost.
The two likelihood approximations induce the following approximations to the posterior p(k, tk, k′, tk|y). Let
equation M33
and
equation M34
To sample from these posterior approximations to p(k, tk, k′, tk|y) (which we denote as [p with hat](k, tk, k′, tk|y)), we use reversible jump MCMC methods. Reversible jump MCMC methods (Green, 1995; DiMatteo et al., 2001) can be used when sampling from a distribution of a random variable θ and dim (θ). In this particular case, the dimensions of the vectors tk and tk vary with the values of k and k′.
Rather than sampling the fixed and random effect knots together in one MCMC iteration (which results in very low acceptance rates), we sample from the posterior [p with hat](k, tk, k′, tk|y) using a Gibbs Sampling type algorithm. In the algorithm, one value of the pair (k, tk) is sampled from [p with hat](k, tk|k′, tk, y) using the Metropolis-Hastings algorithm. Conditioned on this sampled pair of (k, tk), a pair (k′, tk) is sampled from [p with hat](k′, tk|k, tk, y) using identical Metropolis-Hastings methods. This algorithm is given below, and will be referred to as Algorithm I.
The jump probabilities and details of this algorithm (including derivation of the acceptance weights) are given in the Appendix.
An alternative way of sampling from the posterior distribution of (k, tk, k′, tk) is to sample from the joint posterior distribution which contains both the covariance parameters, (ψ and σ2), and (k, tk, k′, tk). This posterior is written as p(k, tk, k′, tk, ψ, σ2|y). Although sampling from this posterior is expected to be less efficient than sampling from the approximations to p(k, tk, k′, tk|y) (ψ and σ2 have to be sampled in addition to (k, tk, k′, tk)), the posterior sample will be exact since p(y|k, tk, k′, tk, ψ, σ2) can be calculated (see (8)). The resulting posterior sample of knots will also allow us to evaluate the accuracy of the two approximations proposed in Section 4. The accuracy of the approximations will be judged by comparing the distribution of (k, tk, k′, tk) when sampling from [p with hat]Laplace(k, tk, k′, tk|y) and [p with hat]Plugged–In(k, tk, k′, tk|y) to the marginal distribution of (k, tk, k′, tk) when sampling from the “true” posterior p(k, tk, k′, tk, ψ, σ2|y). The algorithm used to sample from p(k, tk, k′, tk, ψ, σ2|y) is given below, and will be referred to as Algorithm II.
The jump probabilities and details of this algorithm are given in the Appendix.
Note that Algorithm II is nearly identical to Algorithm I with the exception of Step 3. In Step 3 of Algorithm II, new values of ψ and σ2 are proposed in addition to new values of k′ and tk. To be more specific, the candidates of ψ and σ2 (conditioned on the candidates of k′ and tk) are proposed from a multivariate normal distribution with covariance equaling equation M35. Calculating this estimated covariance matrix in Step 3 makes this algorithm’s computation time per iteration comparable to that of Algorithm I when [p with hat](k, tk, k′, tk| y) = [p with hat]Laplace (k, tk, k′, tk|y). Although the computational cost of these two algorithms are similar, we expect Algorithm II to mix slower and have a lower acceptance rate, as it proposes higher dimensional moves in Step 3.
To sample from the posterior p(k, tk, k′, tk|y), three options have been proposed: (1) sample from [p with hat]Plg (k, tk, k′, tk|y) using Algorithm I, (2) sample from [p with hat]Lp (k, tk, k′, tk|y) using Algorithm I, and (3) exactly sample from p(k, tk, k′, tk, ψ, σ2|y) using Algorithm II. We refer to options (1), (2), and (3) as PLG, LP, and EX, respectively. In this section, we theoretically justify why LP is preferred to PLG and EX.
While LP will take longer to implement than PLG (as calculating [p with hat]Lp (k, tk, k′, tk|y) will take longer than calculating [p with hat]Plg(k, tk, k′, tk|y)), it is expected that PLG will result in unnecessarily large values of k′. We anticipate this positive bias in the number of random effect knots since the plugged-in approximation to p(y|k, tk, k′, tk) does not average over the random effect covariance parameters, ψ; it simply plugs in the maximum likelihood estimate of ψ. The tendency of PLG to estimate more random effect knots than LP becomes clear after close examination of the ratio
equation M36
(9)
calculated in step 3(a) of Algorithm I. In PLG, this ratio is calculated as equation M37, and in LP, this ratio is calculated as equation M38. In general, we expect
equation M39
(10)
when equation M40. If this inequality holds, then it does follow that PLG will over estimate the number of random effect knots. In Theorem 1, it is proven that when equation M41 is the true set of knots, the inequality given in (10) asymptotically holds.
Theorem 1. Let the true values of k, tk, kand tk be knew, tknew, equation M42, respectively. If the four conditions below are met, then as n (the total number of subjects) goes to ∞,
equation M43
  • equation M44.
  • mi, the number of observations recorded for individual i, is random and the random variables m1, m2, …, mn are i.i.d. for all n.
  • equation M45
  • x1, x2, …, xn are i.i.d. vectors, where xj = (xj1, xj2, …, xjmj) are the mj points in time at which subject j is observed.
The proof of this theorem is given in the Appendix.
LP is not only preferred to PLG, but also to EX. The Markov chain resulting from LP converges to the stationary distribution more quickly than that corresponding to EX. This is the case because, in LP, transitions are only made in the space of the fixed and random effect knots. In EX, transitions are made in the space of the fixed knots, the random effect knots, and the covariance parameters. The acceptance probability of the chain corresponding to LP is thus expected to be larger than that corresponding to EX.
While the arguments given in this section suggest that LP is preferred to PLG and EX, it is still of interest to see how these options vary in practice. Of particular interest is to see the extent of the inequality given in (10). If the bias in the number of random effect knots fitted by PLG is only marginal, it might be the preferred option (since it is the easiest approximation to compute). To compare these three options, we conducted a simulation study. This is the subject of Section 7
The accuracies of the posterior distributions constructed using the two likelihood approximations were determined using simulated data sets. Five data sets were simulated from two models at two different sample sizes. Each model had the general form given in (5), but different values of k, tk, k′, tk and α. The specific values of these parameters are given in Table (1).
Table 1
Table 1
True Models Simulated From
The population curve and five randomly sampled subject specific curves corresponding to each model are given in Figure 1. The simulation experiments were performed at two different values of n (n = 25 and n = 40, where n is the number of subjects). The subjects in all simulations each had 20 observations, 5 of which were randomly selected (mi = 5 for all i). With such a small sample size per subject, it was not feasible to smooth observations separately for each individual as has been done in previous work (Behseta et al., 2005).
Figure 1
Figure 1
Plots of Models 1–2. (— —): f(x) + Gi(x), (——): f(x). The location of the fixed effect knots are denoted with a • and the location of the random effect by a Δ.
For each simulated data set, we sampled from [p with hat]Plugged In (k, tk, k′, tk| y) and [p with hat]Laplace(k, tk, k′, tk|y) using Algorithm I, and p(k, tk, k′, tk, ψ, σ2|y) using Algorithm II. We refer to the posterior sample from p(k, tk, k′, tk, ψ, σ2|y) as the “truth” or the “true” posterior sample. The number of iterations for each chain of [p with hat]Plugged In(k, tk, k′, tk|y) and [p with hat]Laplace(k, tk, k′, tk|y) was at least 10,000, and for p(k, tk, k′, tk, ψ, σ2|y) the chains were twice as long. All chains were run using R 2.5.0, and the average time it took to complete 1000 iterations at n = 25 was 18 minutes for [p with hat]Plugged In(k, tk, k′, tk|y), and 54 minutes for [p with hat]Laplace(k, tk, k′, tk|y) and p(k, tk, k′, tk, ψ, σ2|y). At n = 40 the average time it took to complete 1000 iterations was 30 minutes for [p with hat]Plugged In(k, tk, k′, tk| y), and 78 minutes for [p with hat]Laplace(k, tk, k′, tk|y) and p(k, tk, k′, tk, ψ, σ2|y). Convergence to a stationary distribution did not seem to be an issue with any of the chains run. The distributions of k and k′ corresponding to each approximation and p(k, tk, k′, tk, ψ, σ2|y) are shown in Figures (2)(5). The captions of each figure also give the average acceptance rates for the plugged-in, Laplace, and “true” chains. The acceptance rate of the Laplace chains are, in each case, higher than those for the “true” or plugged-in chains. Figure 6 gives the posterior distributions of tk and tk for data simulated from model 1 at n = 25. The accuracy of the Laplace and plugged-in as shown in Figure 6 is similar to the accuracy seen in data simulated from model 1 at n = 25 and from data simulated from model 2 at n = 25 and n = 40 (not shown).
Figure 2
Figure 2
Marginal Distributions of k and k′ for Model 1 at n = 25. Average acceptance rates: Laplace - 72%, Plugged-In - 44%, Truth - 42%.
Figure 5
Figure 5
Marginal Distributions of k and k′ for Model 2 at n = 40. Average acceptance rates: Laplace - 59%, Plugged-In - 51%, Truth - 38%.
Figure 6
Figure 6
Marginal Distributions of tk and tk for Model 1 at n = 25.
The marginal distributions of k′ show that the Laplace method is more accurate than the plugged-in method. In each case, the marginal distributions of k′ for the Laplace method essentially matches the “true” marginal distributions of k′. The plugged-in estimator greatly overestimates the true number of random effect knots (see Figures 25). This is to be expected since both ψ and σ2 were not averaged over (averaging over the parameters would penalize vectors of ψ with large dimensions). The marginal distributions of tk in Figure 6 also show that the Laplace method is more accurate than the plugged-in method. The posterior distributions of tk corresponding to [p with hat]Laplace(k, tk, k′, tk|y) and p(k, tk, k′, tk, ψ, σ2|y) are essentially the same. The posterior distribution of tk corresponding to [p with hat]Plugged–In(k, tk, k′, tk|y) is far from the true posterior distribution of tk; the plugged-in method fits far too many random effect knots and thus can not correctly identify their true locations. Thus, in terms of accuracy, the Laplace approximation is definitely preferred to the plugged-in approximation.
The simulation results also suggest that sampling from [p with hat]Laplace(k, tk, k′, tk|y) is preferred to sampling from the exact posterior p(k, tk, k′, tk, ψ, σ2|y). The distribution of knots corresponding to both posteriors are very similar and the time per iteration is essentially the same. However, sampling from the exact posterior requires that the nuisance parameters ψ and σ2 be sampled jointly with k′ and tk in Step 3 of Algorithm II. The acceptance rate of Algorithm II is thus less than that of Algorithm I when sampling from [p with hat]Laplace(k, tk, k′, tk|y). The lower acceptance rate of Algorithm II also affected the mixing of the chain. Figure 7 gives the trace plots of tk for the Laplace and true chains (for iterations 2000–3000) on one (representative) simulated data set. From the plots in Figure 7, it is clear that Algorithm I (applied to the Laplace approximation) explores the posterior distribution of tk more efficiently than Algorithm II does.
Figure 7
Figure 7
Trace plots of tk for the Laplace and true chains for one simulated data set (for iterations 2000 – 3000). Note: The figures trace the positions of all elements in tk, not just the first. Thus, when the trace splits into two (more ...)
To gain further insight into how the Laplace approximation compares to the “true” posterior, we considered a real data set and sampled from [p with hat]Laplace(k, tk, k′, tk|y) and p(k, tk, k′, tk, ψ, σ2|y). The data set measures the protein content of milk collected from cows who were on a barley diet (Diggle, 1990). Twenty-five cows were in the experiment (n = 25), and milk was collected weekly from each cow for 19 weeks. Five observations from each cow were randomly selected and the reversible jump MCMC algorithms were run on the resulting data set. A plot of the data for all cows is given in Figure 8. Just as with the simulated data sets, 10, 000 iterations of Algorithm I were used to sample from [p with hat]Laplace(k, tk, k′, tk|y) and 20,000 iterations of Algorithm II were used to sample from p(k, tk, k′, tk, ψ, σ2|y). The posterior distributions of k and k′ corresponding to the [p with hat]Laplace(k, tk, k′, tk|y) and p(k, tk, k′, tk, ψ, σ2|y) are given in Figure 9. The posterior distributions of (k, tk, k′, tk) corresponding to [p with hat]Laplace(k, tk, k′, tk|y) once again matches the marginal distribution of (k, tk, k′, tk) when exactly sampling from p(k, tk, k′, tk, ψ, σ2|y).
Figure 8
Figure 8
Protein content of milk measured in 25 cows on Barley diet
Figure 9
Figure 9
Marginal distributions of k and k
In addition to examining the two posterior distributions of knots, we also examine the two corresponding posterior distributions of one randomly selected cow’s curve. Recall that the curve of one cow is the sum of the fixed curve and that cow’s random effect curve. To sample one cow’s curve from the chain corresponding to [p with hat]Laplace(k, tk, k′, tk|y), values of ψ and σ2 were sampled using an independence sampler. Conditioned on these sampled values, a fixed effect curve was sampled, and conditioned on this sampled fixed effect curve, a random effect curve was sampled. The algorithm used to do this requires a slight addition to Algorithm I. This addition is noted below: This addition to Algorithm I slows it down only marginally.
To sample random effects from the chain corresponding to p(k, tk, k′, tk, ψ, σ2|y), steps (2) and (3) given above were just added to Algorithm II. In other words, a fixed and random effect curve were drawn for every random effects covariance matrix drawn. Figure 10 shows the pointwise median curve for the selected cow drawn for each chain, along with pointwise 95% credible intervals. The estimated cow-specific curves and credible intervals for both methods are similar.
Figure 10
Figure 10
Median and 95% pointwise credible intervals for cow-specific curve. (— —): Laplace. (·········): Truth.
We have proposed a flexible approach to model multiple curves using free-knot splines. To identify the number and location of fixed and random effect knots in this flexible mixed-effects regression spline, sampling from [p with hat]Laplace(k, tk, k′, tk|y) is highly preferred to sampling from [p with hat]Plugged In (k, tk, k′, tk|y) and marginally preferred to sampling from p(k, tk, k′, tk, ψ, σ2|y). The posterior sample corresponding to the Laplace approximation is more accurate than that of the plugged-in approximation. The disparity between the Laplace approximation and the plugged-in approximation is most pronounced in the different number of random effect knots fit to the simulated data sets. The Laplace method clearly seems to penalize models with unnecessarily large values of k′ more appropriately than the plugged-in method. In addition, as measured by our simulations, the posterior based on the Laplace approximation is much more efficient to sample from (and just as accurate) as the exact posterior sample from p(k, tk, k′, tk, ψ, σ2|y).
We are currently investigating other approaches to the knot selection problem. Rather than sampling from the posterior p (k, tk, k′, tk|y), we set tk = tk and then try to reduce the number of principal component curves associated with the random effects by sampling from the posterior p (k, tk, r|y) where r = the number of random effect principal components retained (Botts, 2005). This is similar in spirit to the methods employed by Shi et al. (1996), and James et al. (2000).
Behseta et al. (2005) propose an alternative hierarchical approach that, at the first level, fits curves separately for each subject, allowing each subject to have her own set of knots. However, their methods are only appropriate for large within-subject sample sizes.
Figure 3
Figure 3
Marginal Distributions of k and k′ for Model 2 at n = 25. Average acceptance rates: Laplace - 65%, Plugged-In - 39%, Truth - 44%.
Figure 4
Figure 4
Marginal Distributions of k and k′ for Model 1 at n = 40. Average acceptance rates: Laplace - 68%, Plugged-In - 46%, Truth - 39%
Table thumbnail
Appendix
Exact formulas for Iψ, Iα, and Ilog(σ2)
The (j, k)th element in the matrix Iψ, can be calculated as
equation M85
RJMCMC details for Algorithm I
This section gives the details on Algorithm I. The details show how a move is made from a set of old knots to a new set of knots. Note that the constants which appear in this algorithm (.4 and .2) were selected to optimize its mixing and acceptance probability. The old set of knots will be denoted as equation M86.
The first step of the algorithm involves a step in the space of the fixed effect knots, Ωk. Note that Ωk is the sum of spaces Ω1, Ω2, … where Ωj = [a, b]j. Therefore Ωk can be written as equation M87. The first decision to be made is whether to give birth, relocate, or kill a fixed knot. These probabilities are labeled as bF, rF, and dF, respectively, and are calculated as
equation M88
and rF (kold) = 1 – bF (kold) – dF (kold), where pk (kold) is the prior distribution assigned to k evaluated at kold.
If birth is chosen, a fixed effect knot, equation M89, is selected at random and a new knot is given birth to. The new knot, equation M90, is sampled from the distribution equation M91 where equation M92 is the normal distribution truncated at a and b with mean at equation M93 and variance .2. If death is selected, then one of the fixed effect knots is uniformly selected and killed. When relocation is chosen, a fixed effect knot, equation M94, is uniformly selected and then relocated to a position equation M95 where equation M96. The jump probabilities corresponding to each move is given in Table 2. Note: h (y|x) = ϕ (y; x, .2) (Φ (b; x, .2) – Φ (a; x, .2))−1 where ϕ (·; μ, τ2) is the normal density with mean μ and variance τ2, and Φ (·; μ, τ2) is the cumulative distribution function of a normal density with mean μ and variance τ2.
Table 2
Table 2
Jump Probabilities in Ωk
A move must then be made with the random effect knots. This is the second step in the algorithm. As with the fixed effect knots, one of three moves can be made in Ωk, the space of random effect knots: birth, death, or relocation. The probabilities associated with each are denoted as bR, dR, and rR, respectively. They are calculated as
equation M97
and equation M98, where equation M99 is the prior distribution of k′ evaluated at equation M100. If death is selected, then one of the random effect knots is uniformly selected and then killed. If birth is selected, then one of the random effect knots is uniformly selected (call this equation M101), and a knot is added to the set of random effect knots (call this knot equation M102). Note that equation M103. If relocation is selected, then one of the random effect knots is uniformly selected, equation M104, and moved to another location (call this equation M105). Again, equation M106. The jump probabilities associated with these moves are given in Table 3.
Table 3
Table 3
Jump Probabilities in Ωk
Derivation of Metropolis-Hastings Acceptance Probabilities to Satisfy Detailed Balance of Algorithm I
We show that the acceptance probabilities given in Section 4 are such that detailed balance is satisfied. In this case, it needs to be shown that
equation M107
(Note: throughout the rest of this proof, tknew will be denoted as tkn, kold will be denoted as ko, and so on). Assume that both a fixed and random knot are being added. In this case,
equation M108
Assume that the ratios within each min argument are < 1 (if this is not the case, the reciprocal of the ratios will be less than 1, and the proof can be done in the other direction), and recall that
  • equation M109
  • equation M110
  • equation M111
  • equation M112
With these four observations it is clear that the above expression reduces to
equation M113
This argument shows detailed balance holds for the given acceptance ratios when a fixed and random effect knot are added. Similar arguments can be made to prove detailed balance when a fixed and/or random effect knot is deleted and/or relocated.
RJMCMC details for Algorithm II
As mentioned in the main text of this paper, Algorithm II is only a slight variation of Algorithm I. In Algorithm II, candidate values of (k, tk) are sampled just as they are in Algorithm I. Conditioned on each sampled pair of knots, candidate values of k′, tk, ψ and σ2 are sampled. These candidate values are denoted ψnew and equation M114 and they are sampled from the multivariate normal distribution
equation M115
Note that [psi] and [sigma with hat]2 are the maximum likelihood estimates of ψ and [sigma with hat]2 corresponding to the model with fixed effect knots at tkold and random effect knots at equation M116.
The jump probabilities corresponding to step 2 of Algorithm II are the same as those in step 2 of Algorithm I. The jump probabilities corresponding to step 3 of Algorithm II are given in Table 4. Note: MVN3(z, w) is the multivariate density, with mean ([psi], log([sigma with hat]2)) and covariance equation M117 evaluated at the vector (z, w). In this case, [psi] and [sigma with hat]2 are the maximum likelihood estimates calculated under the model with fixed effect knots at tkold and random effect knots at equation M118.
Table 4
Table 4
Jump Probabilities For Step 3 in Algorithm II
It can be shown, using methods similar to those given before, that these jump probabilities (along with the acceptance probabilities of Algorithm II given in Section 5) guarantee detailed balance.
Proof of Theorem 1
We first prove that the maximum likelihood estimators of α, Σγ, and σ2 are consistent with respect to n, the number of subjects in the study. To be more specific, we prove that [alpha], [psi] and [sigma with hat]2 converge to their true values as n goes to infinity. Proving the consistency of these estimators is not straight-forward since the observed data for all n subjects, y1, y2, …, yn, are not i.i.d. (recall that they are independent yet not identically distributed since equation M119).
Hooper (1993) and Demidenko (2004) show, however, that if one considers the design matrices of the ith individual (BFi, BRi) and the number of observations for the ith individual (mi) as random, independent of εi and γi, and part of the observed data, then the i.i.d. assumption of the data is satisfied. For the problem presented in this paper, such considerations make sense. The design matrices corresponding to individual i, BFi and BRi, can be thought of as random since the points in time at which the mi observations occur, xi, vary. If the number of observations recorded for all individuals, m1, m2, …, mn, are i.i.d. (as we assume in Condition (2)), and the times at which these observations occur, x1, x2, …, xn, are i.i.d. (as we assume in Condition (4)), then the sets equation M120 are i.i.d. It follows that the maximum likelihood estimators of α, ψ, and σ2 converge to their true values as n, the number of subjects, approaches ∞.
We now arrive at the result in Theorem 1 by examining the behavior of log equation M121 as n → ∞. Recall that equation M122, making the model with knots at equation M123 nested within the model with knots at equation M124.
equation M125
where equation M126, with
equation M127
From this, we get that
equation M128
where
equation M129
Gelfand and Dey (1994) show that if the models are nested (which is guaranteed by Condition (1)) and if the maximum likelihood estimators exist, then Kn = O(1). Note that existence of the maximum likelihood estimators follows from Condition (3). Demidenko (2004) shows that if equation M130, then the maximum likelihood estimators of α, ψ, and σ2 exist. Since Kn is bounded, for sufficiently large values of n, Kn becomes negligible compared to equation M131, and we can say that
equation M132
and it follows that
equation M133
Contributor Information
Carsten H. Botts, Department of Mathematics and Statistics, Williams College, Williamstown, Massachusetts 01267, USA, cbotts/at/williams.edu.
Michael J. Daniels, Department of Epidemiology and Biostatistics, and Department of Statistics, University of Florida, Gainesville, Florida 32611, USA, mdaniels/at/stat.ufl.edu.
  • deBoor C. A Practical Guide to Splines. New York City: Springer-Verlag; 1978.
  • Behseta S, Kass RE, Wallstrom GL. Hierarchical models for assessing variability among functions. Biometrika. 2005;92:419–434.
  • Bigelow JL, Dunson DB. Bayesian adaptive regression splines for hierarchical data. Biometrics. 2007 in press. [PubMed]
  • Botts C. Ph.D. Thesis. Iowa State University; 2005. Bayesian Methods in Single and Multiple Curve Fitting.
  • Brumback BA, Rice J. Smoothing spline models for the analysis of nested and crossed samples of curves. Journal of the American Statistical Association. 1998;93:961–976.
  • Crainiceanu C, Ruppert D, Wand MP. Bayesian analysis for penalized spline regression using WinBUGS. Journal of Statistical Software. 2005;14 Article 14.
  • Demidenko E. Mixed Models: Theory and Applications. Hoboken, New Jersey: John Wiley & Sons, Inc.; 2004.
  • Dension DGT, Mallick BK, Smith AFM. Automatic Bayesian curve fitting. Journal of the Royal Statistical Society, Series B. 1998;60:333–350.
  • Diggle PJ. Time Series: A Biostatistical Introduction. Oxford: Oxford University Press; 1990.
  • DiMatteo I, Genovese CR, Kass RE. Bayesian curve-fitting with free-knot splines. Biometrika. 2001;88:1055–1071.
  • Dominici F, Daniels M, Zeger SL, Samet JM. Air pollution and mortality: Estimating regional and national dose-response relationships. Journal of the American Statistical Association. 2002;97:100–111.
  • Friedman JH. Multivariate adaptive regression splines. The Annals of Statistics. 1991;19:1–141.
  • Gelfand AE, Dey DK. Bayesian model choice: Asymptotics and exact calculations. Journal of the Royal Statistical Society, Series B. 1994;56:501–514.
  • Green PJ. Reversible jump Markov Chain Monte Carlo computation and Bayesian model determination. Biometrika. 1995;82:711–732.
  • Halpern EF. Bayesian spline regression when the number of knots is unknown. Journal of the Royal Statistical Society, Series B. 1973;35:347–360.
  • Hastie TJ, Tibshirani RJ. Generalized Additive Models. Boca Raton, Florida: Chapman & Hall; 1990.
  • Hooper PM. Iterative weighted least squares estimation in heteroscedastic linear models. Journal of the American Statistical Association. 1993;88:179–184.
  • James GM, Hastie TJ, Sugar CA. Principal component models for sparse functional data. Biometrika. 2000;87:587–602.
  • Kass RE, Wasserman L. A reference Bayesian test for nested hypotheses and its relationship to the Schwarz Criterion. Journal of the American Statistical Association. 1994;90:928–934.
  • Lindstrom MJ. Penalized estimation of free-knot splines. Journal of Computational and Graphical Statistics. 1999;8:333–352.
  • Paciorek CJ. Misinformation in the conjugate prior for the linear model with implications for free-knot spline modeling. Bayesian Analysis. 2006;1:375–383. [PMC free article] [PubMed]
  • Pauler DK. The Schwarz criterion and related methods for normal linear models. Biometrika. 1998;85:13–27.
  • Pourhamadi M. Joint mean-covariance covariance models with applications to longitudinal data, I: Unconstrained parameterization. Biometrika. 1999;86:677–690.
  • Raftery AE, Madigan D, Volinsky CT. Accounting for model uncertainty in survival analysis improves predictive performance (with discussion) In: Bernardo J, Berger J, Dawid A, Smith A, editors. Bayesian Statistics 5. Oxford University Press; 1996. pp. 323–349.
  • Rice J, Wu C. Nonparametric mixed effects models for unequally sampled noisy curves. Biometrics. 2001;57:253–259. [PubMed]
  • Shi M, Weiss RE, Taylor JMG. An analysis of pediatric CD4 counts for AIDS patients using flexible random curves. Applied Statistics. 1996;45:151–163.
  • Stone CJ, Hansen M, Kooperberg C, Truong YK. Polynomial splines and their tensor products in extended linear modeling. The Annals of Statistics. 1997;25:1371–1470.
  • Taplin RH. Robust likelihood calculation for time series. Journal of the Royal Statistics Society, Series B. 1993;55:829–836.
  • Zhou S, Shen X. Spatially adaptive regression splines and accurate knot selection schemes. Journal of the American Statistical Association. 2001;96:247–259.