Home | About | Journals | Submit | Contact Us | Français |

**|**HHS Author Manuscripts**|**PMC2875374

Formats

Article sections

- Summary
- 1. Introduction
- 2. Double Penalized Likelihood
- 3. Computational Algorithm
- 4. Frequentist and Bayesian Standard Errors
- 5. Simulation Studies
- 6. Application to Longitudinal Lactation Study
- 7. Discussion
- References

Authors

Related links

Biometrics. Author manuscript; available in PMC 2010 May 25.

Published in final edited form as:

Published online 2009 April 13. doi: 10.1111/j.1541-0420.2009.01240.x

PMCID: PMC2875374

NIHMSID: NIHMS118524

The publisher's final edited version of this article is available at Biometrics

See other articles in PMC that cite the published article.

We propose a double-penalized likelihood approach for simultaneous model selection and estimation in semiparametric mixed models for longitudinal data. Two types of penalties are jointly imposed on the ordinary log-likelihood: the roughness penalty on the nonparametric baseline function and a nonconcave shrinkage penalty on linear coefficients to achieve model sparsity. Compared to existing estimation equation based approaches, our procedure provides valid inference for data with missing at random, and will be more efficient if the specified model is correct. Another advantage of the new procedure is its easy computation for both regression components and variance parameters. We show that the double penalized problem can be conveniently reformulated into a linear mixed model framework, so that existing software can be directly used to implement our method. For the purpose of model inference, we derive both frequentist and Bayesian variance estimation for estimated parametric and nonparametric components. Simulation is used to evaluate and compare the performance of our method to the existing ones. We then apply the new method to a real data set from a lactation study.

Semiparametric mixed models (SPMMs; Diggle et al., 2002; Zhang et al., 1998) are useful extension to linear mixed models (Lard and Ware, 1982; Verbeke and Molenberghs, 2000; Diggle et al., 2002) and provide a flexible framework for analysis of longitudinal data. Many authors have studied semiparametric models for longitudinal data in various setup (e.g., Wang, 1998; He et al., 2002; Diggle et al., 2002; Ruppert et al., 2003; Fan and Li, 2004; Chen and Jin, 2006). An SPMM uses parametric fixed effects to represent covariate effects and a smooth function to model the time effect, modeling the within-subject correlation using random effects and stochastic processes. Zhang et al. (1998) adopted a penalized likelihood approach based on smoothing splines, which is computationally efficient and can be conveniently implemented in standard software.

In longitudinal studies, often times there are a large number of covariates, but usually not all of them are predictive to the response. For example, in a longitudinal lactation study (Sowers et al., 1993), one hundred fifteen pregnant women were initially enrolled to the study and were scheduled to measure the bone mineral density (BMD) at lumbar spine at four postpartum time points within 18 months. Meanwhile, measurements of many covariates describing participants’ physical characteristics, lactation practice and hormonal environment were also taken. One of the objectives of the study is to investigate the pattern of BMD at lumbar spine for postpartum women and to identify from those potential variables the covariates that are associated with the BMD at lumbar spine. Preliminary analysis indicated that on average the BMD at lumbar spine initially declined and then gradually rebounded, and a parametric function may not be adequate to describe this pattern. This motivates the use of an SPMM to model the association of the postpartum BMD at lumbar spine and other covariates and the research to conduct variable selection in SPMMs.

Many variable selection methods have been developed for linear regression models for independent data, such as best subset selection, stepwise selection and shrinkage methods, etc. However, little work had been done for semiparametric models in the longitudinal data settings until the appearance of a recent work of Fan and Li (2004). In the pioneering paper of Fan and Li (2004), they first proposed two estimation procedures to initially estimate regression coefficients: different-based estimator (DBE) and profile least squares. Then they used the local polynomial regression technique to estimate the nonparametric component, and variable selection was achieved by imposing the SCAD (Fan and Li, 2001) penalty on parametric linear covariate effects. As shown in Fan and Li (2004), their methods ignore the correlation in longitudinal data and are therefore very effective in the class of working independent estimators. It is well-known that the estimating equation approach is robust to the mis-specification of the correlation structure in the data when there is no missing data or the missing data mechanism is missing completely at random (MCAR). However, there are some missing data from the motivating lactation study (see Section 6 for more details). To guard against the possible problem associated with missing data, we propose in this paper a double penalized likelihood approach by explicitly taking into account data correlation while conducting model selection and estimation. In particular, our procedure uses random effects to describe the subject-specific effects and uses Gaussian stochastic processes to model the extra temporal correlation among observations within subjects. Since our approach is likelihood based, it will be robust to missing at random (MAR) mechanism and the resulting estimators will be more efficient when the models on the mean and variance structures are correctly specified. Moreover, our method is based on the smoothing spline framework for nonparametric component estimation, which allows us to reformulate the entire problem as a linear mixed effects model (LMM) and hence greatly facilitate the computation. We also show that the variance parameters can be jointly estimated with the smoothing parameters in a unified fashion.

To avoid terminology confusion, we would like to point out that the notion of “double penalty” has been used by various researchers under different contexts. For example, Lin and Zhang (1999) used double penalized quasi-likelihood approach to make inference for generalized additive mixed models, where roughness penalties were imposed on additive nonparametric functions and a quadratic penalty was applied to the random effects. Zou and Hastie (2005) proposed the elastic net for linear model estimation by using a combined penalty of the form
${\lambda}_{1}{\sum}_{j}\mid {\beta}_{j}\mid +{\lambda}_{2}{\sum}_{j}{\beta}_{j}^{2}$, where *β _{j}* are regression coefficients. Lu and Zhang (2006) and Lu (2006) proposed the functional smooth lasso for functional linear model

The rest of the article is organized as follows. Section 2 first describes the SPMM framework and useful notations. We then introduce the main method, the double-penalized likelihood method for SPMMs. Section 3 describes the computational algorithm for finding the double penalized likelihood estimators. Section 4 derives the variance estimates for parametric and nonparametric components, from both frequentist and Bayesian perspectives. In Section 5 we demonstrate the effectiveness of our method through simulation studies. We illustrate our method through the application to the data from the lactation study in Section 6. We conclude the article with a discussion in Section 7.

Suppose that in a longitudinal study there are *m* subjects, with the *i*th subject having *n _{i}* observations over time. Denote by

$${y}_{ij}=f({t}_{ij})+{\mathbf{x}}_{ij}^{T}\mathit{\beta}+{\mathbf{z}}_{ij}^{T}{\mathbf{b}}_{i}+{U}_{i}({t}_{ij})+{\epsilon}_{ij},$$

(1)

where *f*(*t*) is an arbitrary twice-differentiable smooth function, *β* is the *d* × 1 fixed effects vector of potentially a large number of covariates **x*** _{ij}* from which important covariates are to be selected,

We define some matrix notations for convenience. Define **Y*** _{i}* = (

For fixed variance components ** θ** = (

$$\ell (\beta ,\mathbf{f};\mathbf{Y})=-\frac{1}{2}log\mid \mathbf{V}\mid -\frac{1}{2}{(\mathbf{Y}-\mathbf{X}\mathit{\beta}-\mathbf{Nf})}^{T}{\mathbf{V}}^{-1}(\mathbf{Y}-\mathbf{X}\mathit{\beta}-\mathbf{Nf}),$$

(2)

where **V** = diag{**V**_{1},…, **V*** _{m}*} and
${\mathbf{V}}_{i}={\mathbf{Z}}_{i}\mathbf{D}{\mathbf{Z}}_{i}^{T}+{\mathbf{\Gamma}}_{i}+{\sigma}^{2}{\mathbf{I}}_{{n}_{i}\times {n}_{i}}$. To achieve both model selection and estimation in SPMM (1), we propose to maximize the double penalized (log)likelihood (DPL) function:

$${\ell}_{dp}(\mathit{\beta},f(\xb7);\mathbf{Y})=\ell (\beta ,\mathbf{f};\mathbf{Y})-\frac{{\lambda}_{1}}{2}{\int}_{{T}_{1}}^{{T}_{2}}{\{{f}^{\u2033}(t)\}}^{2}dt-n\sum _{j=1}^{d}{p}_{{\lambda}_{2}}(\mid {\beta}_{j}\mid ),$$

(3)

where *λ*_{1} > 0 is a smoothing parameter controlling the balance between the goodness of fit and the roughness of the estimated *f*(*t*), and *T*_{1} and *T*_{2} specify the range of *t; p _{λ}*

In this section, we formulate a linear mixed model (LMM) representation for the SPMM, and describe the computational procedures for obtaining parameter estimates based on this LMM representation.

By the fact that (*t*) is a cubic smoothing spline and theorem (2.1) of Green and Silverman (1994), the DPL function (3) can be re-written as

$${\ell}_{dp}(\mathit{\beta},f(\xb7);\mathbf{Y})=\ell (\beta ,\mathbf{f};\mathbf{Y})-\frac{{\lambda}_{1}}{2}{\mathbf{f}}^{T}\mathbf{Kf}-n\sum _{j=1}^{d}{p}_{{\lambda}_{2}}(\mid {\beta}_{j}\mid ),$$

(4)

where **K** is the non-negative definite smoothing matrix. Following Green (1987), we have **f** = **T δ**+

$${\ell}_{dp}({\mathit{\beta}}_{\ast};\mathbf{Y})=-\frac{1}{2}{(\mathbf{Y}-{\mathbf{X}}_{\ast}{\mathit{\beta}}_{\ast})}^{T}{\mathbf{V}}_{\ast}^{-1}(\mathbf{Y}-{\mathbf{X}}_{\ast}{\mathit{\beta}}_{\ast})-n\sum _{j=1}^{d}{p}_{{\lambda}_{2}}(\mid {\beta}_{j}\mid ),$$

(5)

where
${\mathbf{V}}_{\ast}=\mathbf{V}+\tau {\mathbf{B}}_{\ast}{\mathbf{B}}_{\ast}^{T}$. The objective function (5) can be regarded as the penalized log-likelihood function of *β*_{*} from the following linear mixed model with the same SCAD penalty for ** β**:

$$\mathbf{Y}={\mathbf{X}}_{\ast}{\mathit{\beta}}_{\ast}+{\mathbf{B}}_{\ast}\mathbf{a}+{\epsilon}_{\ast},$$

(6)

where *β*_{*} = (*δ** ^{T}*,

For fixed parameters *λ*_{1}, *λ*_{2} and ** θ**, direct maximization of (5) is still difficult due to the singularity of the SCAD function. Following Fan and Li (2001, 2004), we adopt the iterative ridge regression approach via the local quadratic approximation (LQA). For a small constant

$$\underset{\mathit{\beta}}{max}{\ell}_{dp}({\widehat{\mathit{\beta}}}_{\ast}\mid {\widehat{\mathit{\beta}}}_{\ast}^{0})\approx -\frac{1}{2}{(\mathbf{Y}-{\mathbf{X}}_{\ast}{\widehat{\mathit{\beta}}}_{\ast})}^{T}{\mathbf{V}}_{\ast}^{-1}(\mathbf{Y}-{\mathbf{X}}_{\ast}{\widehat{\mathit{\beta}}}_{\ast})-\frac{1}{2}{\widehat{\mathit{\beta}}}_{\ast}^{T}{\mathbf{\sum}}_{{\lambda}_{2}}({\widehat{\mathit{\beta}}}_{\ast}^{0}){\widehat{\mathit{\beta}}}_{\ast},$$

where
${\mathbf{\sum}}_{{\lambda}_{2}}(\mathit{\beta})=\text{diag}\{0,0,{p}_{{\lambda}_{2}}^{\prime}(\mid {\beta}_{1}\mid )/\mid {\beta}_{1}\mid ,\dots ,{p}_{{\lambda}_{2}}^{\prime}(\mid {\beta}_{d}\mid )/\mid {\beta}_{d}\mid \}$. For fixed *θ*_{*} = (*τ*, *θ** ^{T}*)

$${\widehat{\mathit{\beta}}}_{\ast}={\left\{{\mathbf{X}}_{\ast}^{T}{\mathbf{V}}_{\ast}^{-1}{\mathbf{X}}_{\ast}+n{\mathbf{\sum}}_{{\lambda}_{2}}({\widehat{\mathit{\beta}}}_{\ast}^{0})\right\}}^{-1}{\mathbf{X}}_{\ast}^{T}{\mathbf{V}}_{\ast}^{-1}\mathbf{Y},$$

(7)

where
${\mathbf{V}}_{\ast}^{-1}={\mathbf{V}}^{-1}-{\mathbf{V}}^{-1}{\mathbf{B}}_{\ast}{({\tau}^{-1}{\mathbf{I}}_{r}+{\mathbf{B}}_{\ast}^{T}{\mathbf{V}}^{-1}{\mathbf{B}}_{\ast})}^{-1}{\mathbf{B}}_{\ast}^{T}{\mathbf{V}}^{-1}$, which is computationally more efficient when *r* (number of distinct *t _{i}*’s) is much smaller than

Very recently, (Zou and Li, 2008) proposed the local linear approximation (LLA) algorithm to solve a SCAD problem. Applying the LLA algorithm to (5) leads to the following optimization problem

$$\underset{\mathit{\beta}}{max}{\ell}_{dp}({\widehat{\mathit{\beta}}}_{\ast}\mid {\widehat{\mathit{\beta}}}_{\ast}^{0})\approx -\frac{1}{2}{(\mathbf{Y}-{\mathbf{X}}_{\ast}{\widehat{\mathit{\beta}}}_{\ast})}^{T}{\mathbf{V}}_{\ast}^{-1}(\mathbf{Y}-{\mathbf{X}}_{\ast}{\widehat{\mathit{\beta}}}_{\ast})-\sum _{j=1}^{d}{p}_{{\lambda}_{2}}^{\prime}(\mid {\widehat{\beta}}_{j}^{0})\mid {\beta}_{j}\mid ,$$

which provides an alternative way to solve the problem.

We now outline an iterative algorithm that alternatively eliminates unimportant variables and updates parameter estimates. First we fit the full linear mixed model (6) in SAS by including all covariates in the model, and compute the initial values (_{*[1]}, **â**_{[1]}) and (_{[1]}, ** **_{[1]}). For a given *λ*_{2}, we propose the following iterative variable selection algorithm:

Initialize with *s* = 1 and (_{*[1]}, **â**_{[1]}, _{[1]}, _{[1]}).

Compute _{*[}_{s}_{+1]} using the iterative ridge regression (7) based on current values of _{*[}_{s}_{]} and (_{[}_{s}_{]}, _{[}_{s}_{]}).

Obtain _{[}_{s}_{+1]} and ** **_{[}_{s}_{+1]} using REML based on important variables.

Let *s* = *s* + 1. Go to *Step 2* until the selected covariates converge to a stable set.

Now we describe the REML estimation procedure in *Step 3*. Denote by **X**_{[}_{s}_{]} the subset of important variables selected from **X** at the *s*th iteration. The REML log-likelihood of (*τ*, ** θ**) at this iteration is

$${\ell}_{R}^{[s]}(\tau ,\mathit{\theta};\mathbf{Y})=-\frac{1}{2}log\mid {\mathbf{V}}_{\ast}\mid -\frac{1}{2}log\mid {\mathbf{X}}_{\ast [s]}^{T}{\mathbf{V}}_{\ast}^{-1}{\mathbf{X}}_{\ast [s]}\mid -\frac{1}{2}{(\mathbf{Y}-{\mathbf{X}}_{\ast [s]}{\widehat{\mathit{\beta}}}_{\ast [s]})}^{T}{\mathbf{V}}_{\ast}^{-1}(\mathbf{Y}-{\mathbf{X}}_{\ast [s]}{\widehat{\mathit{\beta}}}_{\ast [s]}),$$

where **X**_{*[}_{s}_{]} = **[NT**, **X**_{[}_{s}_{]}], and
${\widehat{\mathit{\beta}}}_{\ast [s]}={({\mathbf{X}}_{\ast [s]}^{T}{\mathbf{V}}_{\ast}^{-1}{\mathbf{X}}_{\ast [s]})}^{-1}{\mathbf{X}}_{\ast [s]}^{T}{\mathbf{V}}_{\ast}^{-1}\mathbf{Y}$ is the MLE of *β*_{*} based on the selected important variables **X**_{[}_{s}_{]}. Differentiating
${\ell}_{R}^{[s]}(\tau ,\mathbf{V};\mathbf{Y})$ with respect to *τ* and *θ _{k}* (the

$$-\frac{1}{2}\text{tr}(\mathbf{P}{\mathbf{B}}_{\ast}{\mathbf{B}}_{\ast}^{T})+\frac{1}{2}{(\mathbf{Y}-{\mathbf{X}}_{[s]}{\widehat{\mathit{\beta}}}_{[s]}-\mathbf{N}\widehat{\mathbf{F}})}^{T}{\mathbf{V}}^{-1}{\mathbf{B}}_{\ast}{\mathbf{B}}_{\ast}^{T}{\mathbf{V}}^{-1}(\mathbf{Y}-{\mathbf{X}}_{[s]}{\widehat{\mathit{\beta}}}_{[s]}-\mathbf{N}\widehat{\mathbf{F}})=0$$

(8)

$$-\frac{1}{2}\text{tr}(\mathbf{P}\frac{\partial \mathbf{V}}{\partial {\theta}_{k}})+\frac{1}{2}{(\mathbf{Y}-{\mathbf{X}}_{[s]}{\widehat{\mathit{\beta}}}_{[s]}-\mathbf{N}\widehat{\mathbf{f}})}^{T}{\mathbf{V}}^{-1}\frac{\partial \mathbf{V}}{\partial {\theta}_{i}}{\mathbf{V}}^{-1}(\mathbf{Y}-{\mathbf{X}}_{[s]}{\widehat{\mathit{\beta}}}_{[s]}-\mathbf{N}\widehat{\mathbf{f}})=0,$$

(9)

where
$\mathbf{P}={\mathbf{V}}_{\ast}^{-1}-{\mathbf{V}}_{\ast}^{-1}{\mathbf{X}}_{\ast [s]}{({\mathbf{X}}_{\ast [s]}^{T}{\mathbf{V}}_{\ast}^{-1}{\mathbf{X}}_{\ast}^{T})}^{-1}{\mathbf{X}}_{\ast [s]}^{T}{\mathbf{V}}_{\ast}^{-1}$. Since **V**_{*} is no longer block-diagonal, direct inverse of **V**_{*} may be intractable. We can tackle this problem by using the identity: **P** = **V**^{−1} − **V**^{−1}**D**^{−1}^{T}**V**^{−1}, where is the coefficient matrix of the following system of equations

$$\left[\begin{array}{cc}{\mathbf{X}}_{\ast}^{T}\mathbf{W}{\mathbf{X}}_{\ast}& {\mathbf{X}}_{\ast}^{T}\mathbf{W}{\mathbf{B}}_{\ast}\\ {\mathbf{B}}_{\ast}^{T}\mathbf{W}{\mathbf{X}}_{\ast}& {\mathbf{B}}_{\ast}^{T}\mathbf{W}{\mathbf{B}}_{\ast}+{\lambda}_{1}\mathbf{I}\end{array}\right]\phantom{\rule{0.38889em}{0ex}}\left[\begin{array}{c}{\mathit{\beta}}_{\ast}\\ \mathbf{a}\end{array}\right]=\left[\begin{array}{c}{\mathbf{X}}_{\ast}^{T}\mathbf{W}\mathbf{Y}\\ {\mathbf{B}}_{\ast}^{T}\mathbf{W}\mathbf{Y}\end{array}\right].$$

The Fisher scoring algorithm can then be used to solve (8) and (9) for *τ* and ** θ**.

The aforementioned algorithm is based on fixed SCAD tuning parameters. Fan and Li (2001) recommended to use *a* = 3.7 as it minimizes the Bayesian risk, and thus we set *a* = 3.7 in our implementation. We propose to tune *λ*_{2} using the Bayesian Information Criterion (BIC) (Schwarz, 1978). For a fixed *λ*_{2}, let **X**_{1} and *β*_{1} be the covariate matrix and coefficients respectively corresponding to the *q* variables selected by the iterative variable selection algorithm. Fit LMM (6) using the important variables to get **Ŷ**= **SY**, where **S** is a smoother matrix with *q*_{1} = *trace*(**S**). Then BIC(*λ*_{2}) = −2_{1} + *q*_{1} log *n*, where _{1} = −(*n/*2) log(2*π*) − 1/2 log |**V**| − (1/2)(**Y** − **X**_{1}_{1} − **N**)^{T}**V**^{−1}(**Y** − **X**_{1}_{1} − **N**). The generalized cross validation (GCV; Craven and Wahba, 1979) can also be used for tuning *λ*_{2}. Wang et al. (2007) compared BIC and GCV for selecting the SCAD tuning parameter and suggested that BIC leads to consistent model selection, whereas GCV tends to have an overfitting effect. Model selection results in our simulation studies also favor BIC, and therefore we chose BIC for tuning *λ*_{2}.

We derive the frequentist and Bayesian variance formulas for ** and . The proposed variance estimates are evaluated via simulations. From frequentists’ point of view, var(****Y**|*t*, **x**) = **V**. At convergence, we can write _{*} = (^{T}^{,} * ^{T}*)

$${\widehat{var}}_{F}(\widehat{\mathit{\beta}})={\mathbf{Q}}_{2}\widehat{var}(\mathbf{Y}){\mathbf{Q}}_{2}^{T}={\mathbf{Q}}_{2}\widehat{\mathbf{V}}{\mathbf{Q}}_{2}^{T}.$$

(10)

Denote **â** (*β*_{*}) = **S**_{a}**Y**, where **S**_{a}**= A˜** (**I** − **X**_{*}**Q**) and
$\stackrel{\sim}{\mathbf{A}}={({\lambda}_{1}\mathbf{I}+{\mathbf{B}}_{\ast}^{T}{\mathbf{V}}^{-1}{\mathbf{B}}_{\ast})}^{-1}{\mathbf{B}}_{\ast}^{T}{\mathbf{V}}^{-1}$. Therefore = **T** + **Bâ** = (**TQ**_{1} + **BS*** _{a}*)

$${\widehat{var}}_{F}(\widehat{\mathbf{f}}\mid t,\mathbf{x})={\widehat{\sigma}}^{2}(\mathbf{T}{\mathbf{Q}}_{1}+\mathbf{B}{\mathbf{S}}_{a}){(\mathbf{T}{\mathbf{Q}}_{1}+\mathbf{B}{\mathbf{S}}_{a})}^{T}.$$

(11)

From a Bayesian perspective, the double penalized likelihood structure indicates that *f*(*t*) has a prior in the form of **f** = **T δ** +

$$var\phantom{\rule{0.16667em}{0ex}}\left(\begin{array}{c}\widehat{\mathit{\gamma}}\\ {\widehat{\mathbf{b}}}_{\ast}-{\mathbf{b}}_{\ast}\end{array}\right)={\mathbf{C}}_{\ast}^{-1}={\left(\begin{array}{cc}{\mathcal{X}}_{\ast}^{T}{\mathbf{V}}^{-1}{\mathcal{X}}_{\ast}& {\mathcal{X}}_{\ast}^{T}{\mathbf{V}}^{-1}{\mathbf{Z}}_{\ast}\\ {\mathbf{Z}}_{\ast}^{T}{\mathbf{V}}^{-1}{\mathcal{X}}_{\ast}& {\mathbf{Z}}_{\ast}^{T}{\mathbf{V}}^{-1}{\mathbf{Z}}_{\ast}+{\mathbf{\sum}}_{b}^{-1}\end{array}\right)}^{-1},$$

where **C**_{*} is the corresponding coefficient matrix of Henderson’s mixed model equations. From this it is straightforward to construct the Baysian variance-covariance matrices of ** and . Denote by ****A_{β}**

$${var}_{B}(\widehat{\mathit{\beta}})={\mathbf{A}}_{{\mathit{\beta}}_{1},{\mathit{\beta}}_{2}},$$

(12)

$${var}_{B}(\widehat{\mathbf{f}})=[\mathbf{T},\mathbf{B}]{\mathbf{A}}_{\mathit{\delta},\mathbf{a}}{[\mathbf{T},\mathbf{B}]}^{T}.$$

(13)

These Bayesian variance estimates can be viewed to account for the biases in ** and due to the imposed penalties (Wahba, 1983). We evaluate and compare the empirical performance of frequentist and Bayesian variance estimates in the next section.**

We conduct simulation to evaluate the performance of our DPL procedure and compare this new procedure with the SCAD and LASSO methods in Fan and Li (2004) in both full data and missing data cases. The SCAD and LASSO estimates are computed as Fan and Li’s (2004) method implemented in their original Matlab programs. Following Fan and Li (2004), we set the bandwidth *h* for local polynomial regression equal to *h*_{0}× interquartile range of observed *t _{ij}*’s. We varied the value of

We consider three scenarios in the simulation: full data, missing at random (MAR) data and independent data. For the full data scenario, we simulate longitudinal data from the following semiparametric mixed model:

$${y}_{ij}=f({t}_{ij})+{\mathit{x}}_{i}^{T}\mathit{\beta}+{b}_{0i}+{U}_{i}({t}_{ij})+{\epsilon}_{ij},$$

(14)

where *i* = 1,…, *m* (*m* = 40 or 60), and *j* = 1,…, 10. We consider a staggered entry design: the subjects are equally divided into 5 groups with each group entering at time point 1 through 5; each subject has 10 equally spaced (5 time points in between) measurements. We choose *f*(*t*) = 4 sin(2*πt*/4) as the baseline function. By design, there are 50 knots for the smoothing spline fit, and the time *t* ranges from 0 to 4. The true regression coefficients are ** β** = (.8, .8, 0, 0, .8, 0, 0, 0)

For missing data scenario, we maintain *m* = 40 and simulate MAR data using the main model (14) and the drop-out model in Section 13.3 of Diggle et al. (2002). The *i*th (*i* = 1, 2, ···, 40) subject has no missing value in the first six observations *y _{i}*

Following Wang et al. (2007), we use the median relative model error (MRME) to evaluate the overall performance of a procedure for model selection and estimation in Table 1. By the simulation design described above, we define the overall model error (ME) of a procedure as

$$\text{ME}={(\widehat{\mathit{\beta}}-\mathit{\beta})}^{T}(\widehat{\mathit{\beta}}-\mathit{\beta})+{\int}_{0}^{T}{\{\widehat{f}(t)-f(t)\}}^{2}dt/T,$$

where [0, *T* = 4] specifies the range of the variable *t*, and the second term is approximated by averaging {(*t*)−*f*(*t*)}^{2} over the design knots. The MRME is then the median of ratios of the ME of a selected model to the commone ME of the estimates from the full model fitted by the REML approach of Zhang et al. (1998). To present a more comprehensive picture, we also use other criteria in Table 1 for performance evaluation. The columns labeled by “Corr.” and “Inc.” denote the average numbers of correct and incorrect zeros in the coefficient estimates over 100 simulation runs. The column labeled by “Under-fit” corresponds to the proportion of excluding any nonzero coefficients. Similarly, we report the proportion of selecting the exact subset model in the column “Correct-fit” and the proportion of including all three important variables plus some noise variables in the column “Over-fit”.

As can be seen from Table 1, DPL performs very well in terms of all evaluation criteria. The DPL out-performs SCAD and LASSO in terms the overall MRME criterion, even for the cases of mis-specified correlation structures. Although DPL occasionally is slightly more likely than SCAD and LASSO to under-fit the true model by shrinking some important coefficients to zero, it gives a very competitive performance. In the case of data with missing at random mechanism, the performance of SCAD and LASSO deteriorates as expected, particularly with respect to MRME due the poor performance of (*t*) (Figure 2), while the performance of DPL is not affected by the data missing at random. Although the DPL with a mis-specified correlation structure does not perform as well as the correctly specified DPL, it still gives satisfactory performance and outperforms SCAD and LASSO for a moderate sample size (*m* = 60). In the independent data scenario, although SCAD and LASSO have slightly better performance (as expected), the over-parameterized DPL still delivers a very competitive performance in terms of all evaluation criteria. The estimated variance for random intercept for the over-parameterized DPL is 0.002, indicating that the DPL approach is flexible to produce rather accurate variance component estimates with an over-parameterized correlation structure.

Table 2 summaries the estimated (*β*_{1},*β*_{2},*β*_{5}), their relative biases, empirical and model-based standard errors and 95% coverage probabilities. Since SCAD and LASSO give similar numerical results, we only report DPL and SCAD results here. Compared with SCAD (and LASSO), our estimated *β _{j}*’s have smaller biases in all cases except the independent data scenario. The frequentist and Bayesian standard error formulas derived in Section 4 perform well in most cases: they are close to the empirical estimates and the 95% coverage probability rates are around the nominal level. One observation is that the derived model-based standard errors tend to be under-estimated. We observe that when sample sizes increases (

The estimated baseline function (*t*) is also evaluated through visualization. We plot and compare the estimated *f*(*t*) and point-wise biases by DPL, LASSO and SCAD, for the full data (*m* = 40) and missing at random data scenarios. For our DPL estimate , we also plot the point-wise empirical and model-based frequentist and Bayesian standard errors, and coverage probability of 95% confidence intervals. Figure 1(b) shows that in full data (*m* = 40) case, our approach yields smaller overall biases than the LASSO and SCAD. It can be seen from Figure 1(c) and Figure 1(d) that our standard error formulas for work well. The Bayesian standard errors are always larger in that they account for the biases due to the imposed penalties.

Plots for estimated *f*(*t*) in the full data (m=40) scenario based on 100 samples. Plots (a) and (b) show the averaged fit and point-wise bias by the DPL, SCAD and LASSO methods; plot (c) shows the DPL Bayesian and frequentist standard errors against empirical **...**

Figure 2 depicts the results for missing at random data. As shown in Figure 2(a) and Figure 2(b), SCAD and LASSO produced large biases due to dropped out subjects after the sixth time point. In contrast, our DPL method maintains consistent performance in estimation of *f*(*t*) despite 30% missing data. In Figure 2(c), we notice the disparity between the empirical and the estimated standard errors after *t* ≥ 7 where the dropout starts occurring, which is due to the smaller sample size caused by the missing data.

In this section, we apply the proposed model selection and estimation procedure for SPMMs to the longitudinal data from the lactation study introduced in Section 1. The detailed description of the study can be found in Sowers et al. (1993). Briefly, the study originally enrolled 115 pregnant women with 0 or 1 parity, who had no intent to breast-feed their incoming babies or intended to breast-feed for at least 6 months. The study participants were then scheduled to have BMD at lumbar spine measured at 4 time points after the birth of their babies. The scheduled time points are 2 weeks, 6 months, 12 months and 18 months postpartum. However, due to various logistic reasons, the actual observation times deviate somewhat from the schedule and several participants missed some scheduled visits. For each woman who made the scheduled visit, the information about her physical activity, dietary calcium intake, lactation practice was obtained. At the same time, blood sample was drawn and assay was conducted to measure various serum hormones including prolactin, parathyroid hormone (PTH) and parathyroid hormone-related peptide (PTHrP), traditionally thought to be related to calcium mobilization for lactating mothers. Since only about 56% of the PTHrP measurements were above the detection limit, PTHrP is dichotomized according to whether or not it was above the limit. One of the study objectives is to identify covariates from this pool of variables that are associated with the postpartum BMD at lumbar spine.

Preliminary analysis indicates that random intercept is sufficient to account for the correlation in the postpartum BMD at lumbar spine and that on average it declined in the first 4 months, then gradually rebounded to a level close to that at week 2 postpartum. In other words, a parametric function may not be appropriate to describe the pattern of postpartum BMD at lumbar spine. This motivates us to consider variable selection in the following SPMM

$${y}_{ij}=f({t}_{ij})+{\mathbf{x}}_{ij}^{T}\beta +{b}_{i}+{\epsilon}_{ij},$$

(15)

where *y _{ij}* is the postpartum BMD (g/cm

We apply the DPL procedure to model (15), and the following variables are selected: dietary calcium intake, menstruation status, prolactin, PTH, baseline age and infant’s birth weight. This finding is consistent to those in the epidemiology literature. The estimates of the corresponding regression coefficients as well as their estimated standard errors are presented in Table 4. It is seen from this table that after adjusting for time effect all selected variables are positively associated with the postpartum BMD at lumbar spine except the baseline age. The estimated nonparametric function *f*(*t*) and its 95% pointwise confidence bands given by the frequentist and Bayesian variance estimation are presented in Figure 3. We can see that the BMD in lumber spline decreases in the first six months, starts increasing from the seventh month and reaches a level comparable to the baseline at the end of the study. Frequentist and Bayesian methods give almost identical standard errors, as shown in Figure 3.

In this paper we have proposed a new double penalized likelihood (DPL) approach for selecting important parametric fixed effects in semiparametric mixed models (SPMM) for longitudinal data. The DPL is equipped with two penalty terms: the roughness penalty for *f*(*t*) and a shrinkage penalty for the fixed covariate effects ** β**. Maximizing the DPL leads to a parsimonious model and a smoothing spline fit for

The proposed method is for selecting important parametric fixed effects in an SPMM. In the future we hope to conduct variance component selection as well. We also plan to extend our DPL methodology to generalized semiparametric mixed models to incorporate other types of endpoints and more likelihood structures. We hope to derive theoretical properties for the estimators. As pointed out by a referee, Zou and Li (2008) recently proposed the local linear approximation (LLA) algorithm and a one-step LLA approach for solving penalized log-likelihood, which potentially can be used in place of LQA to get computationally and statistically more efficient DPL estimators.

The research of D. Zhang was supported by National Institute of Health R01 CA85848-08. The research of H. Zhang was supported by in part by National Science Foundation DMS-0645293 and by National Institute of Health R01 CA-085848 The authors thank the editor, the associate editor and two referees for their constructive comments that improved the article.

- Chen K, Jin ZZ. Partial linear regression models for clustered data. Journal of the American Statistical Association. 2006;101:195–204.
- Craven P, Wahba G. Smoothing noisy data with spline functions: estimating the correct degree of smoothing by the method of generalized cross-validation. Numerishe Mathematik. 1979;31:377–403.
- Diggle PJ, Heagerty P, Liang KY, Zeger SL. Analysis of longitudinal data. 2 Oxford University Press; Oxford, U. K: 2002.
- Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association. 2001;96:1348–1360.
- Fan J, Li R. New estimation and model selection procedures for semiparametric modeling in longitudinal data analysis. Journal of the American Statistical Association. 2004;99:710–723.
- Green PJ. Penalized likelihood for general semi-parametric regression models. International Statistical Review. 1987;55:245–260.
- Green PJ, Silverman BW. Nonparametric regression and generalized linear models. Chapman and Hall; London: 1994.
- Harville DA. Maximum likelihood approaches to variance component estimation and to related problems. Journal of the American Statistical Association. 1977;72:320–340.
- He X, Zhu Z, Fung WK. Estimation in a semiparametric model for longitudinal data with unspecified dependence structure. Biometrika. 2002;89:579–590.
- Henderson CR. Best linear unbiased estimation and prediction under a selection model. Biometrics. 1975;31:423–447. [PubMed]
- Kohn R, Ansley C, Tharm D. The performance of cross-validation and maximum likelihood estimators of spline smoothing parameter. Journal of the American Statistical Association. 1991;86:1042–1050.
- Lard NM, Ware JH. Random-effects models for longitudinal data. Biometrics. 1982;38:963–974. [PubMed]
- Lin X, Zhang D. Inference in generalized additive mixed models by using smoothing splines. Journal of the Royal Statistical Society, Ser B. 1999;61:381–400.
- Lu Y. Technical report, Department of Statistics. University of Wisconsin-Madison; 2006. Contributions to functional data analysis with biological applications.
- Lu Y, Zhang C. Technical Report 1126, Department of Statistics. University of Wisconsin-Madison; 2006. Spatially adaptive functional linear regression via functional smooth lasso.
- Ruppert D, Wand MP, Carroll RJ. Semiparametric regression. Cambridge University Press; Cambridge, New York: 2003.
- Schwarz G. Estimating the dimension of a model. Annals of Statistics. 1978;6:461–464.
- Sowers M, Corton G, Shapiro B, Jannausch M, Crutchfield M, Smith M, Randolph J, Hollis B. Changes in bone density with lactation. Journal of the American Medical Association. 1993;169:3130–3135. [PubMed]
- Speed T. Discussion of ‘blup is a good thing: the estimation of random effects’ by G. K. Robinson. Statistical Sciences. 1991;6:15–51.
- Verbeke G, Molenberghs G. Linear mixed models for longitudinal data. Springer-Verlag; New York: 2000.
- Wahba G. Bayesian ‘confidence intervals’ for the cross-validated smoothing spline. Journal of the Royal Statistical Society, Ser B. 1983;45:133–150.
- Wahba G. A comparison of gcv and gml for choosing the smoothing parameter in the generalized spline smoothing problem. Annals of Statistics. 1985;13:1378–1402.
- Wang H, Li R, Tsai CL. Tuning parameter selectors for the smoothly clipped absolute deviation method. Biometrika. 2007;94:553–568. [PMC free article] [PubMed]
- Wang YD. Mixed effects smoothing spline analysis of variance. Journal of the Royal Statistical Society, Ser B. 1998;60:159–174.
- Zhang D, Lin X, Raz J, Sowers M. Semiparametric stochastic mixed models for longitudinal data. Journal of the American Statistical Association. 1998;93:710–719.
- Zou H, Hastie T. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society, Ser B. 2005;67:301–320.
- Zou H, Li R. One-step sparse estimates in nonconcave penalized likelihood models. (with discussion) Annals of Statistics. 2008;36:1509–1533. [PMC free article] [PubMed]

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |