The Cox proportional hazards regression model has become the traditional choice for modeling survival data in medical studies. To introduce flexibility into the Cox model, several smoothing methods may be applied, and approaches based on splines are the most frequently considered in this context. To better understand the effects that each continuous covariate has on the outcome, results can be expressed in terms of splines-based hazard ratio (HR) curves, taking a specific covariate value as reference. Despite the potential advantages of using spline smoothing methods in survival analysis, there is currently no analytical method in the R software to choose the optimal degrees of freedom in multivariable Cox models (with two or more nonlinear covariate effects). This paper describes an R package, called smoothHR, that allows the computation of pointwise estimates of the HRs—and their corresponding confidence limits—of continuous predictors introduced nonlinearly. In addition the package provides functions for choosing automatically the degrees of freedom in multivariable Cox models. The package is available from the R homepage. We illustrate the use of the key functions of the smoothHR package using data from a study on breast cancer and data on acute coronary syndrome, from Galicia, Spain.
The assumption of proportional hazards (PH) fundamental to the Cox PH model sometimes may not hold in practice. In this paper, we propose a generalization of the Cox PH model in terms of the cumulative hazard function taking a form similar to the Cox PH model, with the extension that the baseline cumulative hazard function is raised to a power function. Our model allows for interaction between covariates and the baseline hazard and it also includes, for the two sample problem, the case of two Weibull distributions and two extreme value distributions differing in both scale and shape parameters. The partial likelihood approach can not be applied here to estimate the model parameters. We use the full likelihood approach via a cubic B-spline approximation for the baseline hazard to estimate the model parameters. A semi-automatic procedure for knot selection based on Akaike’s Information Criterion is developed. We illustrate the applicability of our approach using real-life data.
censored survival data analysis; crossing hazards; Frailty model; maximum likelihood; regression; spline function; Akaike information criterion; Weibull distribution; extreme value distribution
Maternal and fetal characteristics are important determinants of fetal growth potential, and should ideally be taken into consideration when evaluating fetal growth variation. We developed a model for individually customised growth charts for estimated fetal weight, which takes into account physiological maternal and fetal characteristics known at the start of pregnancy. We used fetal ultrasound data of 8,162 pregnant women participating in the Generation R Study, a prospective, population-based cohort study from early pregnancy onwards. A repeated measurements regression model was constructed, using backward selection procedures for identifying relevant maternal and fetal characteristics. The final model for estimating expected fetal weight included gestational age, fetal sex, parity, ethnicity, maternal age, height and weight. Using this model, we developed individually customised growth charts, and their corresponding standard deviations, for fetal weight from 18 weeks onwards. Of the total of 495 fetuses who were classified as small size for gestational age (<10th percentile) when fetal weight was evaluated using the normal population growth chart, 80 (16%) were in the normal range when individually customised growth charts were used. 550 fetuses were classified as small size for gestational age using individually customised growth charts, and 135 of them (25%) were classified as normal if the unadjusted reference chart was used. In conclusion, this is the first study using ultrasound measurements in a large population-based study to fit a model to construct individually customised growth charts, taking into account physiological maternal and fetal characteristics. These charts might be useful for use in epidemiological studies and in clinical practice.
Electronic supplementary material
The online version of this article (doi:10.1007/s10654-011-9629-7) contains supplementary material, which is available to authorized users.
Customised fetal growth curves; Ultrasound; Fetal weight; Biometry; Ethnicity; Maternal anthropometrics
In this work, we propose penalized spline based methods for functional mixed effects models with varying coefficients. We decompose longitudinal outcomes as a sum of several terms: a population mean function, covariates with time-varying coefficients, functional subject-specific random effects and residual measurement error processes. Using penalized splines, we propose nonparametric estimation of the population mean function, varying-coefficient, random subject-specific curves and the associated covariance function which represents between-subject variation and the variance function of the residual measurement errors which represents within-subject variation. Proposed methods offer flexible estimation of both the population-level and subject-level curves. In addition, decomposing variability of the outcomes as a between-subject and a within-subject source is useful in identifying the dominant variance component therefore optimally model a covariance function. We use a likelihood based method to select multiple smoothing parameters. Furthermore, we study the asymptotics of the baseline P-spline estimator with longitudinal data. We conduct simulation studies to investigate performance of the proposed methods. The benefit of the between- and within-subject covariance decomposition is illustrated through an analysis of Berkeley growth data where we identified clearly distinct patterns of the between- and within-subject covariance functions of children's heights. We also apply the proposed methods to estimate the effect of anti-hypertensive treatment from the Framingham Heart Study data.
Multi-level functional data; Functional random effects; Semiparametric longitudinal data analysis
One of the major issues in expression profiling analysis still is to outline proper thresholds to determine differential expression, while avoiding false positives. The problem being that the variance is inversely proportional to the log of signal intensities. Aiming to solve this issue, we describe a model, expression variation (EV), based on the LMS method, which allows data normalization and to construct confidence bands of gene expression, fitting cubic spline curves to the Box–Cox transformation. The confidence bands, fitted to the actual variance of the data, include the genes devoid of significant variation, and allow, based on the confidence bandwidth, to calculate EVs. Each outlier is positioned according to the dispersion space (DS) and a P-value is statistically calculated to determine EV. This model results in variance stabilization. Using two Affymetrix-generated datasets, the sets of differentially expressed genes selected using EV and other classical methods were compared. The analysis suggests that EV is more robust on variance stabilization and on selecting differential expression from both rare and strongly expressed genes.
We previously developed a flexible specification of the UNAIDS Estimation and Projection Package (EPP) that relied on splines to generate time-varying values for the force of infection parameter. Here, we test the feasibility of this approach for concentrated HIV/AIDS epidemics with very sparse data and compare two methods for making short-term future projections with the spline-based model.
Penalised B-splines are used to model the average infection risk over time within the EPP 2011 modelling framework, which includes antiretroviral treatment effects and CD4 cell count progression, and is fit to sentinel surveillance prevalence data with a Bayesian algorithm. We compare two approaches for future projections: (1) an informative prior related to equilibrium prevalence and (2) a random walk formulation.
The spline-based model produced plausible fits across a range of epidemics, which included 87 subpopulations from 14 countries with concentrated epidemics and 75 subpopulations from 33 countries with generalised epidemics. The equilibrium prior and random walk approaches to future projections yielded similar prevalence estimates, and both performed well in tests of out-of-sample predictive validity for prevalence. In contrast, in some cases the two approaches varied substantially in estimates of incidence, with the random walk formulation avoiding extreme changes in incidence.
A spline-based approach to allowing the force of infection parameter to vary over time within EPP 2011 is robust across a diverse array of epidemics, including concentrated ones with limited surveillance data. Future work on the EPP model should consider the impact that different modelling approaches have on estimates of HIV incidence.
HIV; Surveillance; Mathematical Model
Wind field analysis from synthetic aperture radar images allows the estimation of wind direction and speed based on image descriptors. In this paper, we propose a framework to automate wind direction retrieval based on wavelet decomposition associated with spectral processing. We extend existing undecimated wavelet transform approaches, by including à trous with B3 spline scaling function, in addition to other wavelet bases as Gabor and Mexican-hat. The purpose is to extract more reliable directional information, when wind speed values range from 5 to 10 ms−1. Using C-band empirical models, associated with the estimated directional information, we calculate local wind speed values and compare our results with QuikSCAT scatterometer data. The proposed approach has potential application in the evaluation of oil spills and wind farms.
SAR; wind direction; FFT; CMOD4; wind speed
Nonparametric regression models are proposed in the framework of ecological inference for exploratory modeling of disease prevalence rates adjusted for variables, such as age, ethnicity/race, and socio-economic status. Ecological inference is needed when a response variable and covariate are not available at the subject level because only summary statistics are available for the reporting unit, for example, in the form of R × C tables. In this article, only the marginal counts are assumed available in the sample of R × C contingency tables for modeling the joint distribution of counts. A general form for the ecological regression model is proposed, whereby certain covariates are included as a varying coefficient regression model, whereas others are included as a functional linear model. The nonparametric regression curves are modeled as splines fit by penalized weighted least squares. A data-driven selection of the smoothing parameter is proposed using the pointwise maximum squared bias computed from averaging kernels (explained by O’Sullivan, 1986, Statistical Science 1, 502–517). Analytic expressions for bias and variance are provided that could be used to study the rates of convergence of the estimators. Instead, this article focuses on demonstrating the utility of the estimators in a study of disparity in health outcomes by ethnicity/race.
Ecological inference; Incomplete R × C tables; P-splines; Randomized response
Fully understanding the determinants and sequelae of fetal growth requires a continuous measure of birth weight adjusted for gestational age. Published United States reference data, however, provide estimates only of the median and lowest and highest 5th and 10th percentiles for birth weight at each gestational age. The purpose of our analysis was to create more continuous reference measures of birth weight for gestational age for use in epidemiologic analyses.
We used data from the most recent nationwide United States Natality datasets to generate multiple reference percentiles of birth weight at each completed week of gestation from 22 through 44 weeks. Gestational age was determined from last menstrual period. We analyzed data from 6,690,717 singleton infants with recorded birth weight and sex born to United States resident mothers in 1999 and 2000.
Birth weight rose with greater gestational age, with increasing slopes during the third trimester and a leveling off beyond 40 weeks. Boys had higher birth weights than girls, later born children higher weights than firstborns, and infants born to non-Hispanic white mothers higher birth weights than those born to non-Hispanic black mothers. These results correspond well with previously published estimates reporting limited percentiles.
Our method provides comprehensive reference values of birth weight at 22 through 44 completed weeks of gestation, derived from broadly based nationwide data. Other approaches require assumptions of normality or of a functional relationship between gestational age and birth weight, which may not be appropriate. These data should prove useful for researchers investigating the predictors and outcomes of altered fetal growth.
MeSH Headings: Birth weight; fetal weight; gestational age; premature birth; ultrasonography
We sough to investigate the effect of serum uric acid (SUA) levels on risk of cancer incidence in men and to flexibly determine the shape of this association by using a novel analytical approach.
A population-based cohort of 78,850 Austrian men who received 264,347 serial SUA measurements was prospectively followed-up for a median of 12.4 years. Data were collected between 1985 and 2003. Penalized splines (P-splines) in extended Cox-type additive hazard regression were used to flexibly model the association between SUA, as a time-dependent covariate, and risk of overall and site-specific cancer incidence and to calculate adjusted hazard ratios with their 95% confidence intervals.
During follow-up 5189 incident cancers were observed. Restricted maximum-likelihood optimizing P-spline models revealed a moderately J-shaped effect of SUA on risk of overall cancer incidence, with statistically significantly increased hazard ratios in the upper third of the SUA distribution. Increased SUA (≥8.00 mg/dL) further significantly increased risk for several site-specific malignancies, with P-spline analyses providing detailed insight about the shape of the association with these outcomes.
Our study is the first to demonstrate a dose–response association between SUA and cancer incidence in men, simultaneously reporting on the usefulness of a novel methodological framework in epidemiologic research.
Cancer incidence; Epidemiology; Extended Cox-type additive hazard regression; Men; Penalized splines; Risk factor; Serum uric acid
Sex differences in fetal growth have been reported, but how this happens remains to be described. It is unknown if fetal growth rates, a reflection of genetic and environmental factors, express sexually dimorphic sensitivity to the mother herself.
This analysis investigated homogeneity of male and female growth responses to maternal height and weight. The study sample included 3495 uncomplicated singleton pregnancies followed longitudinally. Analytic models regressed fetal and neonatal weight on tertiles of maternal height and weight, and modification by sex was investigated (n=1814 males, n=1681 females) with birth gestational age, maternal parity and smoking as covariates.
Sex modified the effects of maternal height and weight on fetal growth rates and birth weight. Among boys, tallest maternal height influenced fetal weight growth prior to 18 gestational weeks of age (p=0.006), pre-pregnancy maternal weight and BMI subsequently had influence (p<0.001); this was not found among girls. Additionally, interaction terms between sex, maternal height, and maternal weight identified that males were more sensitive to maternal weight among shorter mothers (p=0.003), and more responsive to maternal height among lighter mothers (p<=0.03), compared to females. Likewise, neonatal birth weight dimorphism varied by maternal phenotype. A male advantage of 60 grams occurred among neonates of the shortest and lightest mothers (p=0.08), compared to 150 and 191 grams among short and heavy mothers, and tall and light weight mothers, respectively (p=0.01). Sex differences in response to maternal size are underappreciated sources of variation in fetal growth studies and may reflect differential growth strategies.
Maternal anthropometry; fetal growth rate; birth weight; sexual dimorphism; pregnancy
For functional neuroimaging studies that involve experimental stimuli measuring dose levels, e.g. of an anesthetic agent, typical statistical techniques include correlation analysis, analysis of variance or polynomial regression models. These standard approaches have limitations: correlation analysis only provides a crude estimate of the linear relationship between dose levels and brain activity; ANOVA is designed to accommodate a few specified dose levels; polynomial regression models have limited capacity to model varying patterns of association between dose levels and measured activity across the brain. These shortcomings prompt the need to develop methods that more effectively capture dose-dependent neural processing responses. We propose a class of mixed effects spline models that analyze the dose-dependent effect using either regression or smoothing splines. Our method offers flexible accommodation of different response patterns across various brain regions, controls for potential confounding factors, and accounts for subject variability in brain function. The estimates from the mixed effects spline model can be readily incorporated into secondary analyses, for instance, targeting spatial classifications of brain regions according to their modeled response profiles. The proposed spline models are also extended to incorporate interaction effects between the dose-dependent response function and other factors. We illustrate our proposed statistical methodology using data from a PET study of the effect of ethanol on brain function. A simulation study is conducted to compare the performance of the proposed mixed effects spline models and a polynomial regression model. Results show that the proposed spline models more accurately capture varying response patterns across voxels, especially at voxels with complex response shapes. Finally, the proposed spline models can be used in more general settings as a flexible modeling tool for investigating the effects of any continuous covariates on neural processing responses.
Regression splines; Smoothing splines; Dose-dependent effect; Mixed effects spline models; Continuous covariates
Semiparametric additive partial linear models, containing both linear and nonlinear additive components, are more flexible compared to linear models, and they are more efficient compared to general nonparametric regression models because they reduce the problem known as “curse of dimensionality”. In this paper, we propose a new estimation approach for these models, in which we use polynomial splines to approximate the additive nonparametric components and we derive the asymptotic normality for the resulting estimators of the parameters. We also develop a variable selection procedure to identify significant linear components using the smoothly clipped absolute deviation penalty (SCAD), and we show that the SCAD-based estimators of non-zero linear components have an oracle property. Simulations are performed to examine the performance of our approach as compared to several other variable selection methods such as the Bayesian Information Criterion and Least Absolute Shrinkage and Selection Operator (LASSO). The proposed approach is also applied to real data from a nutritional epidemiology study, in which we explore the relationship between plasma beta-carotene levels and personal characteristics (e.g., age, gender, body mass index (BMI), etc.) as well as dietary factors (e.g., alcohol consumption, smoking status, intake of cholesterol, etc.).
BIC; LASSO; penalized likelihood; regression spline; SCAD
The accurate characterization of spike firing rates including the determination of when changes in activity occur is a fundamental issue in the analysis of neurophysiological data. Here we describe a state-space model for estimating the spike rate function that provides a maximum likelihood estimate of the spike rate, model goodness-of-fit assessments, as well as confidence intervals for the spike rate function and any other associated quantities of interest. Using simulated spike data, we first compare the performance of the state-space approach with that of Bayesian adaptive regression splines (BARS) and a simple cubic spline smoothing algorithm. We show that the state-space model is computationally efficient and comparable with other spline approaches. Our results suggest both a theoretically sound and practical approach for estimating spike rate functions that is applicable to a wide range of neurophysiological data.
Mathematical models for revealing the dynamics and interactions properties of biological systems play an important role in computational systems biology. The inference of model parameter values from time-course data can be considered as a "reverse engineering" process and is still one of the most challenging tasks. Many parameter estimation methods have been developed but none of these methods is effective for all cases and can overwhelm all other approaches. Instead, various methods have their advantages and disadvantages. It is worth to develop parameter estimation methods which are robust against noise, efficient in computation and flexible enough to meet different constraints.
Two parameter estimation methods of combining spline theory with Linear Programming (LP) and Nonlinear Programming (NLP) are developed. These methods remove the need for ODE solvers during the identification process. Our analysis shows that the augmented cost function surfaces used in the two proposed methods are smoother; which can ease the optima searching process and hence enhance the robustness and speed of the search algorithm. Moreover, the cores of our algorithms are LP and NLP based, which are flexible and consequently additional constraints can be embedded/removed easily. Eight system biology models are used for testing the proposed approaches. Our results confirm that the proposed methods are both efficient and robust.
The proposed approaches have general application to identify unknown parameter values of a wide range of systems biology models.
Cox models with time-varying coefficients offer great flexibility in capturing the temporal dynamics of covariate effects on right censored failure times. Since not all covariate coefficients are time-varying, model selection for such models presents an additional challenge, which is to distinguish covariates with time-varying coefficient from those with time-independent coefficient. We propose an adaptive group lasso method that not only selects important variables but also selects between time-independent and time-varying specifications of their presence in the model. Each covariate effect is partitioned into a time-independent part and a time-varying part, the latter of which is characterized by a group of coefficients of basis splines without intercept. Model selection and estimation are carried out through a fast, iterative group shooting algorithm. Our approach is shown to have good properties in a simulation study that mimics realistic situations with up to 20 variables. A real example illustrates the utility of the method.
B-spline; Group lasso; Varying-coefficient
We consider Bayesian inference in semiparametric mixed models (SPMMs) for longitudinal data. SPMMs are a class of models that use a nonparametric function to model a time effect, a parametric function to model other covariate effects, and parametric or nonparametric random effects to account for the within-subject correlation. We model the nonparametric function using a Bayesian formulation of a cubic smoothing spline, and the random effect distribution using a normal distribution and alternatively a nonparametric Dirichlet process (DP) prior. When the random effect distribution is assumed to be normal, we propose a uniform shrinkage prior (USP) for the variance components and the smoothing parameter. When the random effect distribution is modeled nonparametrically, we use a DP prior with a normal base measure and propose a USP for the hyperparameters of the DP base measure. We argue that the commonly assumed DP prior implies a nonzero mean of the random effect distribution, even when a base measure with mean zero is specified. This implies weak identifiability for the fixed effects, and can therefore lead to biased estimators and poor inference for the regression coefficients and the spline estimator of the nonparametric function. We propose an adjustment using a postprocessing technique. We show that under mild conditions the posterior is proper under the proposed USP, a flat prior for the fixed effect parameters, and an improper prior for the residual variance. We illustrate the proposed approach using a longitudinal hormone dataset, and carry out extensive simulation studies to compare its finite sample performance with existing methods.
Dirichlet process prior; Identifiability; Postprocessing; Random effects; Smoothing spline; Uniform shrinkage prior; Variance components
In this paper we develop a new framework for path planning of flexible needles with bevel tips. Based on a stochastic model of needle steering, the probability density function for the needle tip pose is approximated as a Gaussian. The means and covariances are estimated using an error propagation algorithm which has second order accuracy. Then we adapt the path-of-probability (POP) algorithm to path planning of flexible needles with bevel tips. We demonstrate how our planning algorithm can be used for feedback control of flexible needles. We also derive a closed-form solution for the port placement problem for finding good insertion locations for flexible needles in the case when there are no obstacles. Furthermore, we propose a new method using reference splines with the POP algorithm to solve the path planning problem for flexible needles in more general cases that include obstacles.
flexible needles; path planning; stochastic model; path-of-probability algorithm; error propagation; port placement; feedback control
To determine the extent to which fetal weight during mid-pregnancy and fetal weight gain from mid-pregnancy to birth predict adiposity and blood pressure (BP) at age 3 years.
Among 438 children in the Project Viva cohort, we estimated fetal weight at 16–20 (median 18) weeks gestation using ultrasound biometry measures. We analyzed fetal weight gain as change in quartile of weight from the second trimester until birth, and we measured height, weight, subscapular and triceps skinfold thicknesses and BP at age 3.
Mean (SD) estimated weight at 16–20 weeks was 234 (30) grams and birth weight was 3518 (420) grams. In adjusted models, weight estimated during the second trimester and at birth were associated with higher BMI z-scores at age 3 years (0.32 units [95% C.I. 0.04, 0.60] and 0.53 units [95% C.I. 0.24, 0.81] for the highest v. lowest quartile of weight). Infants with more rapid fetal weight gain and those who remained large from mid-pregnancy to birth had higher BMI z-scores (0.85 units [95% C.I. 0.30, 1.39] and 0.63 units [95% C.I. 0.17, 1.09], respectively) at age 3 than infants who remained small during fetal life. We did not find associations between our main predictors and sum or ratio of subscapular and triceps skinfold thicknesses or systolic BP.
More rapid fetal weight gain and persistently high fetal weight during the second half of gestation predicted higher BMI z-score at age 3 years. The rate of fetal weight gain throughout pregnancy may be important for future risk of adiposity in childhood.
childhood blood pressure; cohort
For dairy producers, a reliable description of lactation curves is a valuable tool for management and selection. From a breeding and production viewpoint, milk yield persistency and total milk yield are important traits. Understanding the genetic drivers for the phenotypic variation of both these traits could provide a means for improving these traits in commercial production.
It has been shown that Natural Cubic Smoothing Splines (NCSS) can model the features of lactation curves with greater flexibility than the traditional parametric methods. NCSS were used to model the sire effect on the lactation curves of cows. The sire solutions for persistency and total milk yield were derived using NCSS and a whole-genome approach based on a hierarchical model was developed for a large association study using single nucleotide polymorphisms (SNP).
Estimated sire breeding values (EBV) for persistency and milk yield were calculated using NCSS. Persistency EBV were correlated with peak yield but not with total milk yield. Several SNP were found to be associated with both traits and these were used to identify candidate genes for further investigation.
NCSS can be used to estimate EBV for lactation persistency and total milk yield, which in turn can be used in whole-genome association studies.
Functional principal components (FPC) analysis is widely used to decompose and express functional observations. Curve estimates implicitly condition on basis functions and other quantities derived from FPC decompositions; however these objects are unknown in practice. In this article, we propose a method for obtaining correct curve estimates by accounting for uncertainty in FPC decompositions. Additionally, pointwise and simultaneous confidence intervals that account for both model- and decomposition-based variability are constructed. Standard mixed model representations of functional expansions are used to construct curve estimates and variances conditional on a specific decomposition. Iterated expectation and variance formulas combine model-based conditional estimates across the distribution of decompositions. A bootstrap procedure is implemented to understand the uncertainty in principal component decomposition quantities. Our method compares favorably to competing approaches in simulation studies that include both densely and sparsely observed functions. We apply our method to sparse observations of CD4 cell counts and to dense white-matter tract profiles. Code for the analyses and simulations is publicly available, and our method is implemented in the R package refund on CRAN.
Bootstrap; Functional principal components analysis; Iterated expectation and variance; Simultaneous bands
Kinetic analysis is used to extract metabolic information from dynamic positron emission tomography (PET) uptake data. The theory of indicator dilutions, developed in the seminal work of Meier and Zierler (1954), provides a probabilistic framework for representation of PET tracer uptake data in terms of a convolution between an arterial input function and a tissue residue. The residue is a scaled survival function associated with tracer residence in the tissue. Nonparametric inference for the residue, a deconvolution problem, provides a novel approach to kinetic analysis—critically one that is not reliant on specific compartmental modeling assumptions. A practical computational technique based on regularized cubic B-spline approximation of the residence time distribution is proposed. Nonparametric residue analysis allows formal statistical evaluation of specific parametric models to be considered. This analysis needs to properly account for the increased flexibility of the nonparametric estimator. The methodology is illustrated using data from a series of cerebral studies with PET and fluorodeoxyglucose (FDG) in normal subjects. Comparisons are made between key functionals of the residue, tracer flux, flow, etc., resulting from a parametric (the standard two-compartment of Phelps et al. 1979) and a nonparametric analysis. Strong statistical evidence against the compartment model is found. Primarily these differences relate to the representation of the early temporal structure of the tracer residence—largely a function of the vascular supply network. There are convincing physiological arguments against the representations implied by the compartmental approach but this is the first time that a rigorous statistical confirmation using PET data has been reported. The compartmental analysis produces suspect values for flow but, notably, the impact on the metabolic flux, though statistically significant, is limited to deviations on the order of 3%–4%. The general advantage of the nonparametric residue analysis is the ability to provide a valid kinetic quantitation in the context of studies where there may be heterogeneity or other uncertainty about the accuracy of a compartmental model approximation of the tissue residue.
Deconvolution; Functional inference; Kinetic analysis; Regularization
The semiparametric partially linear model allows flexible modeling of covariate effects on the response variable in regression. It combines the flexibility of nonparametric regression and parsimony of linear regression. The most important assumption in the existing methods for the estimation in this model is to assume a priori that it is known which covariates have a linear effect and which do not. However, in applied work, this is rarely known in advance. We consider the problem of estimation in the partially linear models without assuming a priori which covariates have linear effects. We propose a semiparametric regression pursuit method for identifying the covariates with a linear effect. Our proposed method is a penalized regression approach using a group minimax concave penalty. Under suitable conditions we show that the proposed approach is model-pursuit consistent, meaning that it can correctly determine which covariates have a linear effect and which do not with high probability. The performance of the proposed method is evaluated using simulation studies, which support our theoretical results. A real data example is used to illustrated the application of the proposed method.
Group selection; Minimax concave penalty; Model-pursuit consistency; Penalized regression; Semiparametric models
We analyze the Agatston score of coronary artery calcium (CAC) from the Multi-Ethnic Study of Atherosclerosis (MESA) using semi-parametric zero-inflated modeling approach, where the observed CAC scores from this cohort consist of high frequency of zeroes and continuously distributed positive values. Both partially constrained and unconstrained models are considered to investigate the underlying biological processes of CAC development from zero to positive, and from small amount to large amount. Different from existing studies, a model selection procedure based on likelihood cross-validation is adopted to identify the optimal model, which is justified by comparative Monte Carlo studies. A shrinkaged version of cubic regression spline is used for model estimation and variable selection simultaneously. When applying the proposed methods to the MESA data analysis, we show that the two biological mechanisms influencing the initiation of CAC and the magnitude of CAC when it is positive are better characterized by an unconstrained zero-inflated normal model. Our results are significantly different from those in published studies, and may provide further insights into the biological mechanisms underlying CAC development in human. This highly flexible statistical framework can be applied to zero-inflated data analyses in other areas.
cardiovascular disease; coronary artery calcium; likelihood cross-validation; model selection; penalized spline; proportional constraint; shrinkage
We propose a semiparametric Bayesian method for handling measurement error in nutritional epidemiological data. Our goal is to estimate nonparametrically the form of association between a disease and exposure variable while the true values of the exposure are never observed. Motivated by nutritional epidemiological data we consider the setting where a surrogate covariate is recorded in the primary data, and a calibration data set contains information on the surrogate variable and repeated measurements of an unbiased instrumental variable of the true exposure. We develop a flexible Bayesian method where not only is the relationship between the disease and exposure variable treated semiparametrically, but also the relationship between the surrogate and the true exposure is modeled semiparametrically. The two nonparametric functions are modeled simultaneously via B-splines. In addition, we model the distribution of the exposure variable as a Dirichlet process mixture of normal distributions, thus making its modeling essentially nonparametric and placing this work into the context of functional measurement error modeling. We apply our method to the NIH-AARP Diet and Health Study and examine its performance in a simulation study.
B-splines; Dirichlet process prior; Gibbs sampling; Measurement error; Metropolis-Hastings algorithm; Partly linear model