|Home | About | Journals | Submit | Contact Us | Français|
We model sparse functional data from multiple subjects with a mixed-effects regression spline. In this model, the expected values for any subject (conditioned on the random effects) can be written as the sum of a population curve and a subject-specific deviate from this population curve. The population curve and the subject-specific deviates are both modeled as free-knot b-splines with k and k′ knots located at tk and tk′, respectively. To identify the number and location of the “free” knots, we sample from the posterior p (k, tk, k′, tk′|y) using reversible jump MCMC methods. Sampling from this posterior distribution is complicated, however, by the flexibility we allow for the model’s covariance structure. No restrictions (other than positive definiteness) are placed on the covariance parameters ψ and σ2 and, as a result, no analytical form for the likelihood p (y|k, tk, k′, tk′) exists. In this paper, we consider two approximations to p(y|k, tk, k′, tk′) and then sample from the corresponding approximations to p(k, tk, k′, tk′|y). We also sample from p(k, tk, k′, tk′, ψ, σ2|y) which has a likelihood that is available in closed form. While sampling from this larger posterior is less efficient, the resulting marginal distribution of knots is exact and allows us to evaluate the accuracy of each approximation. We then consider a real data set and explore the difference between p(k, tk, k′, tk′, ψ, σ2|y) and the more accurate approximation to p(k, tk, k′, tk′|y).
Social and physical scientists are often interested in how certain variables depend on one another. They often assume that a functional relationship exists between these two variables, and they run experiments or collect samples to learn about this relationship. They are aware, however, that the data does not follow a deterministic equation, but follows the general stochastic equation
where Y is a dependent variable, x is an independent or controlled variable, and ε is random error with mean 0. One of the simplest models relating a response, Y, to a single predictor, x, is the pth–order polynomial regression model
While this model is appealing and can be very useful, it may be inadequate since it assumes that only one polynomial describes the average relationship between Y and x. A more complex model would allow several polynomials to do this. An example of such a model is a piecewise polynomial regression spline. A piecewise polynomial regression spline with k knots divides the domain of f(x) into k + 1 mutually exclusive regions, and to each region corresponds a unique polynomial describing the average relationship between Y and x. With different polynomials describing different parts of the function, f(x) is not restricted to have the same smoothness throughout its domain. A piecewise polynomial regression spline of order p with k knots at tk = (t1, t2, …, tk) is written as
where β = (β0, β1, …, βp+k) is a fixed set of parameters and .
A natural polynomial regression spline takes a form nearly identical to the piecewise regression spline seen above. It is different, however, in that it restricts the function to be linear at the boundaries. This sets the last term on the right-hand side of (2) to 0. A natural polynomial regression spline of order p thus takes the form
If f (x) were modeled as such, (1) could then be re-written as
where and β = (β0, β1, β2, …, β1+k), and for m observations collected at (x1, x2, …, xm), it would be written as
The natural polynomial regression spline written in Equation (3) illustrates the concept behind splines, but the design matrix composed of the row vectors c (x1, tk), …, c (xm, tk) is unstable and thus rarely used. B-splines, which are computationally more stable, are preferred (Hastie and Tibshirani, 1990; Zhou and Shen, 2001). The functional form of a b-spline is more complex than that of the natural polynomial regression spline, but is readily available (deBoor, 1978). We write it as
where b(x, tk) is the design vector associated with a b-spline evaluated at x with k knots at tk = (t1, t2, …, tk), and α is a fixed set of parameters. For m observations collected at (x1, …, xm), the model would be written as
With this model, estimating f(x) becomes a problem of estimating the number of knots, k, the locations of those knots, tk, and α.
A variety of methods have been developed in estimating k and tk. Halpern (1973) approached this problem using Bayesian methods. Allowing knots to only be placed at the design points in the experiment, he considered all of the subsets of the design points. Halpern assigned prior probabilities to all of these subsets, and calculated the corresponding posterior probabilities of these subsets. His estimator of the function was based on these posterior probabilities. Denison, Mallick, and Smith (1998) placed priors on the number of knots, k, and their locations, tk. With these priors, they calculated a joint posterior distribution which included the variables k and tk and then sampled from this posterior distribution using reversible jump MCMC methods. They too restricted the knots to be located only at the design points of the experiment. DiMatteo, Genovese, and Kass (2001) proposed a method similar to that of Denison, et al. They did not restrict the knots to be located only at the design points of the experiment and they penalized models with unnecessarily large values of k by averaging over the fixed effect coefficients, α.
Friedman (1991) and Stone et al. (1997) try to solve the problem of knot selection using backward and/or forward knot selection. This process continues until the “best” model has been identified. Zhou and Shen (2001) used an alternative method in identifying the locations of the knots. They constructed an algorithm which favored adding knots in locations where several knots had already been added. Lindstrom (1999) used similar methods when selecting knot locations.
Splines have also been used to model curves that vary within and between subjects (Shi et al., 1996, Rice and Wu, 2001, Behseta et al., 2005, Crainiceanu et al., 2005). Brumback and Rice (1998) use splines to model nested functions which vary between subjects and between groups of subjects. Functional models which only account for variation between and within subjects take the general form
where Yi (xj) is the observation of the ith individual at xj, f (x) can be thought of as a population curve, and Gi (x) is a random curve specific to subject i. While these functions can be modeled in a variety of ways, we model these two functions as free-knot b-splines with k and k′ knots located at tk and tk′, respectively. Setting f(x) = b(x; tk)α and Gi(x) = b(x; tk′)γi, where γi are independent random vectors such that γi ~ N(0, Σγ), equation (4) becomes
Bigelow and Dunson (2007) consider a variation of this model and identify the location of the knots by sampling from the full posterior distribution. They lessen the computational burden of this sampling procedure by setting tk′ = tk and restricting Σγ to be diagonal. In this paper, we consider a more flexible model and explore the computational issues associated with it. To be more specific, we eliminate the restrictions imposed by Bigelow and Dunson and then try to identify the number and location of the knots by sampling from p(k, tk, k′, tk′|y). This posterior can not be sampled from exactly, however, because eliminating the diagonal restriction on Σγ makes the likelihood p (y|k, tk, k′, tk′) intractable. We thus sample from two approximations to this posterior; each approximation corresponds to a different approximation to the likelihood p (y|k, tk, k′, tk′). To assess the accuracy of these approximations, we compare the posterior distribution of knots in each case to the marginal distribution of knots when sampling from p(k, tk, k′, tk′, ψ, σ2|y), where ψ is a set of covariance parameters and σ2 is the within-subject variability. The marginal distribution of knots when sampling from this higher-dimensional posterior is exact as its corresponding likelihood is available in closed form. Sampling from this larger posterior, however, is less efficient because of the additional parameters that need to be sampled.
In Section 2, we discuss the linear model given in (5) in more detail. In Section 3, the likelihood of interest, p(y|k, tk, k′, tk′), is discussed as are the two methods we use to approximate this likelihood. Section 4 shows the algorithm used to sample from the resulting approximate posterior distributions, and Section 5 shows the algorithm we use to sample from p(k, tk, k′, tk′, ψ, σ2|y). Section 6 theoretically compares the approximations, and in Section 7, we describe a simulation study performed to explore the differences in these approximations. In Section 8, we apply the preferred approximation to a real data set.
Let Yi (xj) be the observed value of the ith curve (i = 1, …, n) at xj. We specify the following mixed model for functional data,
where b(xj, tq) is the design vector of a b-spline with q knots at tq, k is the number of fixed effect knots at locations tk, k′ is the number of random effect knots at tk′, α is a fixed set of parameters with dim (α) = k+2, γi are independent and identically distributed random vectors such that γi ~ N (0, Σγ) and dim (γi) = k′+2, εij are independent random variables with εij ~ N(0, σ2) for all pairs (i, j), and the random vectors γi are independent of all random variables εij. The appealing feature of this model is its flexibility; recall that no restriction is being placed on Σγ (other than |Σγ| > 0), and that the random and fixed effect knots are not set equal to one another. We do assume, however, that all individuals have the same random effect knots, and that k′ ≤ k. When the number of observations per subject is small, such assumptions are needed, as they limit the number of complicated models proposed in the Markov chain we consider.
We place prior distributions on k, k′, tk, tk′, α, Σγ, and σ2. Poisson priors are given to k and k′. Additionally, since there is no reason to favor knots at any particular location on the domain of f and G, flat priors are placed on both tk and tk′. These priors are written as
where tq(j) is the jth smallest knot in the vector tq, and (a, b) is the domain of f and G.
For Σγ and σ2, we place prior distributions on transformed values of these parameters. The transformations we consider create a set of unconstrained parameters which are preferable to work with since their sampling distributions can be well approximated with the normal density. For Σγ, we applied the modified Cholesky parameterization proposed by Pourhamadi (1999). Pourhamadi decomposes the inverse of the covariance matrix Σγ as , where
These parameters can be calculated directly from the elements of the covariance matrix using the functions g (·;·) and h (·;·), where for 2 ≤ j ≤ p , ϕj = (ϕj,1, ϕj,2, …, ϕj,j−1)′ = h (Σγ;j) = , Σγ,j is the minor matrix within Σγ composed of its first j − 1 columns and rows, and σj is the vector within Σγ composed of the first j – 1 elements of the jth column. To ease our notational burden, we will refer to these modified Cholesky parameters collectively as ψ, where , and we will replace the symbol Σγ with ψ (or, to illustrate that Σγ is a function of ψ, with Σγ(ψ)). All of the elements in the vector ψ are unconstrained, and a multivariate normal unit-information prior is assigned to them. The mean of the unit-information prior on ψ is the maximum likelihood estimate of the parameters, , and the variance is inversely proportional to approximately one unit of information on ψ. The prior can be written as
p(y|k, tk, k′, tk′, α, ψ, σ2) is the distribution of all the observed data in the study, n is the number of subjects in the study, and , , and are the maximum likelihood estimates of α, ψ, and σ, respectively. The exact formula for Iψ is given in the Appendix.
For σ2, a unit-information prior was placed on log (σ2). The prior can be written as
and mi is the number of observations for subject i. The exact formula for Ilog(σ2), i is given in the Appendix.
The prior placed on α is another unit-information prior and can be written as
The exact formula for Iα,i is given in the Appendix.
Kass and Wasserman (1994) and Pauler (1998) defend the use of unit-information priors, especially in the context of model selection. They show that when using such priors, the resulting Bayes factors can be well approximated with the Bayesian information criterion. Although centering unit-information priors at the maximum likelihood estimates has been suggested in Bayesian adaptive regression splines (Paciorek, 2006), the final results presented in this paper were not sensitive to the means selected for these three prior distributions.
Estimating the fixed and random effect functions now becomes a problem in estimating k, tk, k′, and tk′. We do this by sampling from the posterior distribution p (k, tk, k′, tk′|y). While this procedure is intuitively appealing, a problem arises since the likelihood p (y|k, tk, k′, tk′) can not be calculated analytically. We consider two approximations to this likelihood in the next section.
The likelihood of interest can be expressed as
xi = (x1i, x2i, …, xmi) is the vector containing the x–values corresponding to subject i, n is the total number of subjects (curves) in the study, and mi is the total number of observations for subject i.
The integral in (7) can not be calculated analytically. We thus explore two different approximations to p (y|k, tk, k′, tk′). The first simply plugs in the maximum likelihood estimate of ψ and σ2. The second approximates (7) using a Laplace approximation.
The first approximation that we consider estimates p (y|k, tk, k′, tk′) with
where (tk, tk′) and 2 (tk, tk′) are the maximum likelihood estimators of ψ and σ2 corresponding to the model with fixed effect knots at tk and random effect knots at tk′. The dependence of these maximum likelihood estimators on (tk, tk′) will be suppressed, and the estimators will now simply be denoted as and 2. We refer to this likelihood approximation as the “plugged-in” approximation and denote it as Plugged In (y|k, tk, k′, tk′). Although this approximation ignores the penalty from increasing the dimension of ψ, it is computationally inexpensive and is similar to the methods employed by Taplin (1993), Draper (1995), Raftery et al. (1996), and Dominici et al. (2002). They approximate marginal likelihoods by plugging in the maximum likelihood estimates of the nuisance parameters that can not be integrated out.
An alternative, but computationally more complex approach, is to integrate out ψ and σ2 using an approximation method. We use a Laplace approximation to estimate p (y|k, tk, k′, tk′). The approximate likelihood is given by
and dim(ψ) = (k′ + 3)(k′ + 2)/2. We chose the unconstrained parameterization for Σγ, ψ, and σ2, log(σ2), to make this Laplace approximation more accurate. The optimal parameterization is an area of future research. Calculating makes the computational expense of the Laplace approximation greater than that of the “plugged-in” approximation; the matrix of second derivatives takes time to compute and can be numerically unstable when dim(ψ) is large and the total number of subjects, n, is small. We still expect, however, that the Laplace approximation will be more accurate than the “plugged-in” approximation. A major question of interest addressed in this paper is whether the Laplace’s gain in accuracy is worth its computational cost.
The two likelihood approximations induce the following approximations to the posterior p(k, tk, k′, tk′|y). Let
To sample from these posterior approximations to p(k, tk, k′, tk′|y) (which we denote as (k, tk, k′, tk′|y)), we use reversible jump MCMC methods. Reversible jump MCMC methods (Green, 1995; DiMatteo et al., 2001) can be used when sampling from a distribution of a random variable θ and dim (θ). In this particular case, the dimensions of the vectors tk and tk′ vary with the values of k and k′.
Rather than sampling the fixed and random effect knots together in one MCMC iteration (which results in very low acceptance rates), we sample from the posterior (k, tk, k′, tk′|y) using a Gibbs Sampling type algorithm. In the algorithm, one value of the pair (k, tk) is sampled from (k, tk|k′, tk′, y) using the Metropolis-Hastings algorithm. Conditioned on this sampled pair of (k, tk), a pair (k′, tk′) is sampled from (k′, tk′|k, tk, y) using identical Metropolis-Hastings methods. This algorithm is given below, and will be referred to as Algorithm I.
The jump probabilities and details of this algorithm (including derivation of the acceptance weights) are given in the Appendix.
An alternative way of sampling from the posterior distribution of (k, tk, k′, tk′) is to sample from the joint posterior distribution which contains both the covariance parameters, (ψ and σ2), and (k, tk, k′, tk′). This posterior is written as p(k, tk, k′, tk′, ψ, σ2|y). Although sampling from this posterior is expected to be less efficient than sampling from the approximations to p(k, tk, k′, tk′|y) (ψ and σ2 have to be sampled in addition to (k, tk, k′, tk′)), the posterior sample will be exact since p(y|k, tk, k′, tk′, ψ, σ2) can be calculated (see (8)). The resulting posterior sample of knots will also allow us to evaluate the accuracy of the two approximations proposed in Section 4. The accuracy of the approximations will be judged by comparing the distribution of (k, tk, k′, tk′) when sampling from Laplace(k, tk, k′, tk′|y) and Plugged–In(k, tk, k′, tk′|y) to the marginal distribution of (k, tk, k′, tk′) when sampling from the “true” posterior p(k, tk, k′, tk′, ψ, σ2|y). The algorithm used to sample from p(k, tk, k′, tk′, ψ, σ2|y) is given below, and will be referred to as Algorithm II.
The jump probabilities and details of this algorithm are given in the Appendix.
Note that Algorithm II is nearly identical to Algorithm I with the exception of Step 3. In Step 3 of Algorithm II, new values of ψ and σ2 are proposed in addition to new values of k′ and tk′. To be more specific, the candidates of ψ and σ2 (conditioned on the candidates of k′ and tk′) are proposed from a multivariate normal distribution with covariance equaling . Calculating this estimated covariance matrix in Step 3 makes this algorithm’s computation time per iteration comparable to that of Algorithm I when (k, tk, k′, tk′| y) = Laplace (k, tk, k′, tk′|y). Although the computational cost of these two algorithms are similar, we expect Algorithm II to mix slower and have a lower acceptance rate, as it proposes higher dimensional moves in Step 3.
To sample from the posterior p(k, tk, k′, tk′|y), three options have been proposed: (1) sample from Plg (k, tk, k′, tk′|y) using Algorithm I, (2) sample from Lp (k, tk, k′, tk′|y) using Algorithm I, and (3) exactly sample from p(k, tk, k′, tk′, ψ, σ2|y) using Algorithm II. We refer to options (1), (2), and (3) as PLG, LP, and EX, respectively. In this section, we theoretically justify why LP is preferred to PLG and EX.
While LP will take longer to implement than PLG (as calculating Lp (k, tk, k′, tk′|y) will take longer than calculating Plg(k, tk, k′, tk′|y)), it is expected that PLG will result in unnecessarily large values of k′. We anticipate this positive bias in the number of random effect knots since the plugged-in approximation to p(y|k, tk, k′, tk′) does not average over the random effect covariance parameters, ψ; it simply plugs in the maximum likelihood estimate of ψ. The tendency of PLG to estimate more random effect knots than LP becomes clear after close examination of the ratio
calculated in step 3(a) of Algorithm I. In PLG, this ratio is calculated as , and in LP, this ratio is calculated as . In general, we expect
when . If this inequality holds, then it does follow that PLG will over estimate the number of random effect knots. In Theorem 1, it is proven that when is the true set of knots, the inequality given in (10) asymptotically holds.
Theorem 1. Let the true values of k, tk, k′ and tk′ be knew, tknew, , respectively. If the four conditions below are met, then as n (the total number of subjects) goes to ∞,
The proof of this theorem is given in the Appendix.
LP is not only preferred to PLG, but also to EX. The Markov chain resulting from LP converges to the stationary distribution more quickly than that corresponding to EX. This is the case because, in LP, transitions are only made in the space of the fixed and random effect knots. In EX, transitions are made in the space of the fixed knots, the random effect knots, and the covariance parameters. The acceptance probability of the chain corresponding to LP is thus expected to be larger than that corresponding to EX.
While the arguments given in this section suggest that LP is preferred to PLG and EX, it is still of interest to see how these options vary in practice. Of particular interest is to see the extent of the inequality given in (10). If the bias in the number of random effect knots fitted by PLG is only marginal, it might be the preferred option (since it is the easiest approximation to compute). To compare these three options, we conducted a simulation study. This is the subject of Section 7
The accuracies of the posterior distributions constructed using the two likelihood approximations were determined using simulated data sets. Five data sets were simulated from two models at two different sample sizes. Each model had the general form given in (5), but different values of k, tk, k′, tk′ and α. The specific values of these parameters are given in Table (1).
The population curve and five randomly sampled subject specific curves corresponding to each model are given in Figure 1. The simulation experiments were performed at two different values of n (n = 25 and n = 40, where n is the number of subjects). The subjects in all simulations each had 20 observations, 5 of which were randomly selected (mi = 5 for all i). With such a small sample size per subject, it was not feasible to smooth observations separately for each individual as has been done in previous work (Behseta et al., 2005).
For each simulated data set, we sampled from Plugged In (k, tk, k′, tk′| y) and Laplace(k, tk, k′, tk′|y) using Algorithm I, and p(k, tk, k′, tk′, ψ, σ2|y) using Algorithm II. We refer to the posterior sample from p(k, tk, k′, tk′, ψ, σ2|y) as the “truth” or the “true” posterior sample. The number of iterations for each chain of Plugged In(k, tk, k′, tk′|y) and Laplace(k, tk, k′, tk′|y) was at least 10,000, and for p(k, tk, k′, tk′, ψ, σ2|y) the chains were twice as long. All chains were run using R 2.5.0, and the average time it took to complete 1000 iterations at n = 25 was 18 minutes for Plugged In(k, tk, k′, tk′|y), and 54 minutes for Laplace(k, tk, k′, tk′|y) and p(k, tk, k′, tk′, ψ, σ2|y). At n = 40 the average time it took to complete 1000 iterations was 30 minutes for Plugged In(k, tk, k′, tk′| y), and 78 minutes for Laplace(k, tk, k′, tk′|y) and p(k, tk, k′, tk′, ψ, σ2|y). Convergence to a stationary distribution did not seem to be an issue with any of the chains run. The distributions of k and k′ corresponding to each approximation and p(k, tk, k′, tk′, ψ, σ2|y) are shown in Figures (2) – (5). The captions of each figure also give the average acceptance rates for the plugged-in, Laplace, and “true” chains. The acceptance rate of the Laplace chains are, in each case, higher than those for the “true” or plugged-in chains. Figure 6 gives the posterior distributions of tk and tk′ for data simulated from model 1 at n = 25. The accuracy of the Laplace and plugged-in as shown in Figure 6 is similar to the accuracy seen in data simulated from model 1 at n = 25 and from data simulated from model 2 at n = 25 and n = 40 (not shown).
The marginal distributions of k′ show that the Laplace method is more accurate than the plugged-in method. In each case, the marginal distributions of k′ for the Laplace method essentially matches the “true” marginal distributions of k′. The plugged-in estimator greatly overestimates the true number of random effect knots (see Figures 2 – 5). This is to be expected since both ψ and σ2 were not averaged over (averaging over the parameters would penalize vectors of ψ with large dimensions). The marginal distributions of tk′ in Figure 6 also show that the Laplace method is more accurate than the plugged-in method. The posterior distributions of tk′ corresponding to Laplace(k, tk, k′, tk′|y) and p(k, tk, k′, tk′, ψ, σ2|y) are essentially the same. The posterior distribution of tk′ corresponding to Plugged–In(k, tk, k′, tk′|y) is far from the true posterior distribution of tk′; the plugged-in method fits far too many random effect knots and thus can not correctly identify their true locations. Thus, in terms of accuracy, the Laplace approximation is definitely preferred to the plugged-in approximation.
The simulation results also suggest that sampling from Laplace(k, tk, k′, tk′|y) is preferred to sampling from the exact posterior p(k, tk, k′, tk′, ψ, σ2|y). The distribution of knots corresponding to both posteriors are very similar and the time per iteration is essentially the same. However, sampling from the exact posterior requires that the nuisance parameters ψ and σ2 be sampled jointly with k′ and tk′ in Step 3 of Algorithm II. The acceptance rate of Algorithm II is thus less than that of Algorithm I when sampling from Laplace(k, tk, k′, tk′|y). The lower acceptance rate of Algorithm II also affected the mixing of the chain. Figure 7 gives the trace plots of tk′ for the Laplace and true chains (for iterations 2000–3000) on one (representative) simulated data set. From the plots in Figure 7, it is clear that Algorithm I (applied to the Laplace approximation) explores the posterior distribution of tk′ more efficiently than Algorithm II does.
To gain further insight into how the Laplace approximation compares to the “true” posterior, we considered a real data set and sampled from Laplace(k, tk, k′, tk′|y) and p(k, tk, k′, tk′, ψ, σ2|y). The data set measures the protein content of milk collected from cows who were on a barley diet (Diggle, 1990). Twenty-five cows were in the experiment (n = 25), and milk was collected weekly from each cow for 19 weeks. Five observations from each cow were randomly selected and the reversible jump MCMC algorithms were run on the resulting data set. A plot of the data for all cows is given in Figure 8. Just as with the simulated data sets, 10, 000 iterations of Algorithm I were used to sample from Laplace(k, tk, k′, tk′|y) and 20,000 iterations of Algorithm II were used to sample from p(k, tk, k′, tk′, ψ, σ2|y). The posterior distributions of k and k′ corresponding to the Laplace(k, tk, k′, tk′|y) and p(k, tk, k′, tk′, ψ, σ2|y) are given in Figure 9. The posterior distributions of (k, tk, k′, tk′) corresponding to Laplace(k, tk, k′, tk′|y) once again matches the marginal distribution of (k, tk, k′, tk′) when exactly sampling from p(k, tk, k′, tk′, ψ, σ2|y).
In addition to examining the two posterior distributions of knots, we also examine the two corresponding posterior distributions of one randomly selected cow’s curve. Recall that the curve of one cow is the sum of the fixed curve and that cow’s random effect curve. To sample one cow’s curve from the chain corresponding to Laplace(k, tk, k′, tk′|y), values of ψ and σ2 were sampled using an independence sampler. Conditioned on these sampled values, a fixed effect curve was sampled, and conditioned on this sampled fixed effect curve, a random effect curve was sampled. The algorithm used to do this requires a slight addition to Algorithm I. This addition is noted below: This addition to Algorithm I slows it down only marginally.
To sample random effects from the chain corresponding to p(k, tk, k′, tk′, ψ, σ2|y), steps (2) and (3) given above were just added to Algorithm II. In other words, a fixed and random effect curve were drawn for every random effects covariance matrix drawn. Figure 10 shows the pointwise median curve for the selected cow drawn for each chain, along with pointwise 95% credible intervals. The estimated cow-specific curves and credible intervals for both methods are similar.
We have proposed a flexible approach to model multiple curves using free-knot splines. To identify the number and location of fixed and random effect knots in this flexible mixed-effects regression spline, sampling from Laplace(k, tk, k′, tk′|y) is highly preferred to sampling from Plugged In (k, tk, k′, tk′|y) and marginally preferred to sampling from p(k, tk, k′, tk′, ψ, σ2|y). The posterior sample corresponding to the Laplace approximation is more accurate than that of the plugged-in approximation. The disparity between the Laplace approximation and the plugged-in approximation is most pronounced in the different number of random effect knots fit to the simulated data sets. The Laplace method clearly seems to penalize models with unnecessarily large values of k′ more appropriately than the plugged-in method. In addition, as measured by our simulations, the posterior based on the Laplace approximation is much more efficient to sample from (and just as accurate) as the exact posterior sample from p(k, tk, k′, tk′, ψ, σ2|y).
We are currently investigating other approaches to the knot selection problem. Rather than sampling from the posterior p (k, tk, k′, tk′|y), we set tk = tk′ and then try to reduce the number of principal component curves associated with the random effects by sampling from the posterior p (k, tk, r|y) where r = the number of random effect principal components retained (Botts, 2005). This is similar in spirit to the methods employed by Shi et al. (1996), and James et al. (2000).
Behseta et al. (2005) propose an alternative hierarchical approach that, at the first level, fits curves separately for each subject, allowing each subject to have her own set of knots. However, their methods are only appropriate for large within-subject sample sizes.
The (j, k)th element in the matrix Iψ, can be calculated as
This section gives the details on Algorithm I. The details show how a move is made from a set of old knots to a new set of knots. Note that the constants which appear in this algorithm (.4 and .2) were selected to optimize its mixing and acceptance probability. The old set of knots will be denoted as .
The first step of the algorithm involves a step in the space of the fixed effect knots, Ωk. Note that Ωk is the sum of spaces Ω1, Ω2, … where Ωj = [a, b]j. Therefore Ωk can be written as . The first decision to be made is whether to give birth, relocate, or kill a fixed knot. These probabilities are labeled as bF, rF, and dF, respectively, and are calculated as
and rF (kold) = 1 – bF (kold) – dF (kold), where pk (kold) is the prior distribution assigned to k evaluated at kold.
If birth is chosen, a fixed effect knot, , is selected at random and a new knot is given birth to. The new knot, , is sampled from the distribution where is the normal distribution truncated at a and b with mean at and variance .2. If death is selected, then one of the fixed effect knots is uniformly selected and killed. When relocation is chosen, a fixed effect knot, , is uniformly selected and then relocated to a position where . The jump probabilities corresponding to each move is given in Table 2. Note: h (y|x) = ϕ (y; x, .2) (Φ (b; x, .2) – Φ (a; x, .2))−1 where ϕ (·; μ, τ2) is the normal density with mean μ and variance τ2, and Φ (·; μ, τ2) is the cumulative distribution function of a normal density with mean μ and variance τ2.
A move must then be made with the random effect knots. This is the second step in the algorithm. As with the fixed effect knots, one of three moves can be made in Ωk′, the space of random effect knots: birth, death, or relocation. The probabilities associated with each are denoted as bR, dR, and rR, respectively. They are calculated as
and , where is the prior distribution of k′ evaluated at . If death is selected, then one of the random effect knots is uniformly selected and then killed. If birth is selected, then one of the random effect knots is uniformly selected (call this ), and a knot is added to the set of random effect knots (call this knot ). Note that . If relocation is selected, then one of the random effect knots is uniformly selected, , and moved to another location (call this ). Again, . The jump probabilities associated with these moves are given in Table 3.
We show that the acceptance probabilities given in Section 4 are such that detailed balance is satisfied. In this case, it needs to be shown that
(Note: throughout the rest of this proof, tknew will be denoted as tkn, kold will be denoted as ko, and so on). Assume that both a fixed and random knot are being added. In this case,
Assume that the ratios within each min argument are < 1 (if this is not the case, the reciprocal of the ratios will be less than 1, and the proof can be done in the other direction), and recall that
With these four observations it is clear that the above expression reduces to
This argument shows detailed balance holds for the given acceptance ratios when a fixed and random effect knot are added. Similar arguments can be made to prove detailed balance when a fixed and/or random effect knot is deleted and/or relocated.
As mentioned in the main text of this paper, Algorithm II is only a slight variation of Algorithm I. In Algorithm II, candidate values of (k, tk) are sampled just as they are in Algorithm I. Conditioned on each sampled pair of knots, candidate values of k′, tk′, ψ and σ2 are sampled. These candidate values are denoted ψnew and and they are sampled from the multivariate normal distribution
Note that and 2 are the maximum likelihood estimates of ψ and 2 corresponding to the model with fixed effect knots at tkold and random effect knots at .
The jump probabilities corresponding to step 2 of Algorithm II are the same as those in step 2 of Algorithm I. The jump probabilities corresponding to step 3 of Algorithm II are given in Table 4. Note: MVN3(z, w) is the multivariate density, with mean (, log(2)) and covariance evaluated at the vector (z, w). In this case, and 2 are the maximum likelihood estimates calculated under the model with fixed effect knots at tkold and random effect knots at .
It can be shown, using methods similar to those given before, that these jump probabilities (along with the acceptance probabilities of Algorithm II given in Section 5) guarantee detailed balance.
Proof of Theorem 1
We first prove that the maximum likelihood estimators of α, Σγ, and σ2 are consistent with respect to n, the number of subjects in the study. To be more specific, we prove that , and 2 converge to their true values as n goes to infinity. Proving the consistency of these estimators is not straight-forward since the observed data for all n subjects, y1, y2, …, yn, are not i.i.d. (recall that they are independent yet not identically distributed since ).
Hooper (1993) and Demidenko (2004) show, however, that if one considers the design matrices of the ith individual (BFi, BRi) and the number of observations for the ith individual (mi) as random, independent of εi and γi, and part of the observed data, then the i.i.d. assumption of the data is satisfied. For the problem presented in this paper, such considerations make sense. The design matrices corresponding to individual i, BFi and BRi, can be thought of as random since the points in time at which the mi observations occur, xi, vary. If the number of observations recorded for all individuals, m1, m2, …, mn, are i.i.d. (as we assume in Condition (2)), and the times at which these observations occur, x1, x2, …, xn, are i.i.d. (as we assume in Condition (4)), then the sets are i.i.d. It follows that the maximum likelihood estimators of α, ψ, and σ2 converge to their true values as n, the number of subjects, approaches ∞.
We now arrive at the result in Theorem 1 by examining the behavior of log as n → ∞. Recall that , making the model with knots at nested within the model with knots at .
where , with
From this, we get that
Gelfand and Dey (1994) show that if the models are nested (which is guaranteed by Condition (1)) and if the maximum likelihood estimators exist, then Kn = O(1). Note that existence of the maximum likelihood estimators follows from Condition (3). Demidenko (2004) shows that if , then the maximum likelihood estimators of α, ψ, and σ2 exist. Since Kn is bounded, for sufficiently large values of n, Kn becomes negligible compared to , and we can say that
and it follows that
Carsten H. Botts, Department of Mathematics and Statistics, Williams College, Williamstown, Massachusetts 01267, USA, Email: ude.smailliw@sttobc.
Michael J. Daniels, Department of Epidemiology and Biostatistics, and Department of Statistics, University of Florida, Gainesville, Florida 32611, USA, Email: ude.lfu.tats@sleinadm.