In a longitudinal study, suppose that the primary endpoint is the time to a specific event. This response variable, however, may be censored by an independent censoring variable or by the occurrence of one of several dependent competing events. For each study subject, a set of baseline covariates is collected. The question is how to construct a reliable prediction rule for the future subject’s profile of all competing risks of interest at a specific time point for risk-benefit decision makings. In this paper, we propose a two-stage procedure to make inferences about such subject-specific profiles. For the first step, we use a parametric model to obtain a univariate risk index score system. We then estimate consistently the average competing risks for subjects which have the same parametric index score via a nonparametric function estimation procedure. We illustrate this new proposal with the data from a randomized clinical trial for evaluating the efficacy of a treatment for prostate cancer. The primary endpoint for this study was the time to prostate cancer death, but had two types of dependent competing events, one from cardiovascular death and the other from death of other causes.
Local likelihood function; Nonparametric function estimation; Perturbation-resampling method; Risk index score
The complexity of semiparametric models poses new challenges to statistical inference and model selection that frequently arise from real applications. In this work, we propose new estimation and variable selection procedures for the semiparametric varying-coefficient partially linear model. We first study quantile regression estimates for the nonparametric varying-coefficient functions and the parametric regression coefficients. To achieve nice efficiency properties, we further develop a semiparametric composite quantile regression procedure. We establish the asymptotic normality of proposed estimators for both the parametric and nonparametric parts and show that the estimators achieve the best convergence rate. Moreover, we show that the proposed method is much more efficient than the least-squares-based method for many non-normal errors and that it only loses a small amount of efficiency for normal errors. In addition, it is shown that the loss in efficiency is at most 11.1% for estimating varying coefficient functions and is no greater than 13.6% for estimating parametric components. To achieve sparsity with high-dimensional covariates, we propose adaptive penalization methods for variable selection in the semiparametric varying-coefficient partially linear model and prove that the methods possess the oracle property. Extensive Monte Carlo simulation studies are conducted to examine the finite-sample performance of the proposed procedures. Finally, we apply the new methods to analyze the plasma beta-carotene level data.
Asymptotic relative efficiency; composite quantile regression; semiparametric varying-coefficient partially linear model; oracle properties; variable selection
Subgroup analysis arises in clinical trials research when we wish to estimate a treatment effect on a specific subgroup of the population distinguished by baseline characteristics. Many trial designs induce latent subgroups such that subgroup membership is observable in one arm of the trial and unidentified in the other. This occurs, for example, in oncology trials when a biopsy or dissection is performed only on subjects randomized to active treatment. We discuss a general framework to estimate a biological treatment effect on the latent subgroup of interest when the survival outcome is right-censored and can be appropriately modelled as a parametric function of covariate effects. Our framework builds on the application of instrumental variables methods to all-or-none treatment noncompliance. We derive a computational method to estimate model parameters via the EM algorithm and provide guidance on its implementation in standard software packages. The research is illustrated through an analysis of a seminal melanoma trial that proposed a new standard of care for the disease and involved a biopsy that is available only on patients in the treatment arm.
survival analysis; accelerated failure time model; treatment noncompliance; mixture model; EM algorithm
The primary goal of a randomized clinical trial is to make comparisons among two or more treatments. For example, in a two-arm trial with continuous response, the focus may be on the difference in treatment means; with more than two treatments, the comparison may be based on pairwise differences. With binary outcomes, pairwise odds-ratios or log-odds ratios may be used. In general, comparisons may be based on meaningful parameters in a relevant statistical model. Standard analyses for estimation and testing in this context typically are based on the data collected on response and treatment assignment only. In many trials, auxiliary baseline covariate information may also be available, and it is of interest to exploit these data to improve the efficiency of inferences. Taking a semiparametric theory perspective, we propose a broadly-applicable approach to adjustment for auxiliary covariates to achieve more efficient estimators and tests for treatment parameters in the analysis of randomized clinical trials. Simulations and applications demonstrate the performance of the methods.
Covariate adjustment; Hypothesis test; k-arm trial; Kruskal-Wallis test; Log-odds ratio; Longitudinal data; Semiparametric theory
Semiparametric linear transformation models have received much attention due to its high flexibility in modeling survival data. A useful estimating equation procedure was recently proposed by Chen et al. (2002) for linear transformation models to jointly estimate parametric and nonparametric terms. They showed that this procedure can yield a consistent and robust estimator. However, the problem of variable selection for linear transformation models is less studied, partially because a convenient loss function is not readily available under this context. In this paper, we propose a simple yet powerful approach to achieve both sparse and consistent estimation for linear transformation models. The main idea is to derive a profiled score from the estimating equation of Chen et al. (2002), construct a loss function based on the profile scored and its variance, and then minimize the loss subject to some shrinkage penalty. Under regularity conditions, we have shown that the resulting estimator is consistent for both model estimation and variable selection. Furthermore, the estimated parametric terms are asymptotically normal and can achieve higher efficiency than that yielded from the estimation equations. For computation, we suggest a one-step approximation algorithm which can take advantage of the LARS and build the entire solution path efficiently. Performance of the new procedure is illustrated through numerous simulations and real examples including one microarray data.
Censored survival data; Linear transformation models; LARS; Shrinkage; Variable selection
The hazard ratio provides a natural target for assessing a treatment effect with survival data, with the Cox proportional hazards model providing a widely used special case. In general, the hazard ratio is a function of time and provides a visual display of the temporal pattern of the treatment effect. A variety of nonproportional hazards models have been proposed in the literature. However, available methods for flexibly estimating a possibly time-dependent hazard ratio are limited. Here, we investigate a semiparametric model that allows a wide range of time-varying hazard ratio shapes. Point estimates as well as pointwise confidence intervals and simultaneous confidence bands of the hazard ratio function are established under this model. The average hazard ratio function is also studied to assess the cumulative treatment effect. We illustrate corresponding inference procedures using coronary heart disease data from the Women's Health Initiative estrogen plus progestin clinical trial.
Clinical trial; Empirical process; Gaussian process; Hazard ratio; Simultaneous inference; Survival analysis; Treatment–time interaction
We propose a double-penalized likelihood approach for simultaneous model selection and estimation in semiparametric mixed models for longitudinal data. Two types of penalties are jointly imposed on the ordinary log-likelihood: the roughness penalty on the nonparametric baseline function and a nonconcave shrinkage penalty on linear coefficients to achieve model sparsity. Compared to existing estimation equation based approaches, our procedure provides valid inference for data with missing at random, and will be more efficient if the specified model is correct. Another advantage of the new procedure is its easy computation for both regression components and variance parameters. We show that the double penalized problem can be conveniently reformulated into a linear mixed model framework, so that existing software can be directly used to implement our method. For the purpose of model inference, we derive both frequentist and Bayesian variance estimation for estimated parametric and nonparametric components. Simulation is used to evaluate and compare the performance of our method to the existing ones. We then apply the new method to a real data set from a lactation study.
Correlated data; Gaussian stochastic process; Linear mixed models; Smoothly clipped absolute deviation; Smoothing splines
We propose a family of regression models to adjust for nonrandom dropouts in the analysis of longitudinal outcomes with fully observed covariates. The approach conceptually focuses on generalized linear models with random effects. A novel formulation of a shared random effects model is presented and shown to provide a dropout selection parameter with a meaningful interpretation. The proposed semiparametric and parametric models are made part of a sensitivity analysis to delineate the range of inferences consistent with observed data. Concerns about model identifiability are addressed by fixing some model parameters to construct functional estimators that are used as the basis of a global sensitivity test for parameter contrasts. Our simulation studies demonstrate a large reduction of bias for the semiparametric model relatively to the parametric model at times where the dropout rate is high or the dropout model is misspecified. The methodology’s practical utility is illustrated in a data analysis.
Exponential family distribution; Functional estimators; Global sensitivity analysis; Informative dropout; Infimum/Supremum statistic; Nonparametric mixture; Uniform convergence; non-identifiable models
There has been great interest in developing nonlinear structural equation models and associated statistical inference procedures, including estimation and model selection methods. In this paper a general semiparametric structural equation model (SSEM) is developed in which the structural equation is composed of nonparametric functions of exogenous latent variables and fixed covariates on a set of latent endogenous variables. A basis representation is used to approximate these nonparametric functions in the structural equation and the Bayesian Lasso method coupled with a Markov Chain Monte Carlo (MCMC) algorithm is used for simultaneous estimation and model selection. The proposed method is illustrated using a simulation study and data from the Affective Dynamics and Individual Differences (ADID) study. Results demonstrate that our method can accurately estimate the unknown parameters and correctly identify the true underlying model.
Bayesian Lasso; Latent variable; Spline; Structural equation model
In a prospective cohort study, information on clinical parameters, tests and molecular markers is often collected. Such information is useful to predict patient prognosis and to select patients for targeted therapy. We propose a new graphical approach, the positive predictive value (PPV) curve, to quantify the predictive accuracy of prognostic markers measured on a continuous scale with censored failure time outcome. The proposed method highlights the need to consider both predictive values and the marker distribution in the population when evaluating a marker, and it provides a common scale for comparing different markers. We consider both semiparametric and nonparametric based estimating procedures. In addition, we provide asymptotic distribution theory and resampling based procedures for making statistical inference. We illustrate our approach with numerical studies and datasets from the Seattle Heart Failure Study.
Prognostic accuracy; Positive predictive value; Survival analysis
This article describes a class of heteroscedastic generalized linear regression models in which a subset of the regression parameters are rescaled nonparametrically, and develops efficient semiparametric inferences for the parametric components of the models. Such models provide a means to adapt for heterogeneity in the data due to varying exposures, varying levels of aggregation, and so on. The class of models considered includes generalized partially linear models and nonparametrically scaled link function models as special cases. We present an algorithm to estimate the scale function nonparametrically, and obtain asymptotic distribution theory for regression parameter estimates. In particular, we establish that the asymptotic covariance of the semiparametric estimator for the parametric part of the model achieves the semiparametric lower bound. We also describe bootstrap-based goodness-of-scale test. We illustrate the methodology with simulations, published data, and data from collaborative research on ultrasound safety.
Generalized linear regression; Heteroscedasticity; Nonparametric regression; Partially linear model; Semiparametric efficiency; Varying-coefficient model
To compare two samples of censored data, we propose a unified semiparametric inference for the parameter of interest when the model for one sample is parametric and that for the other is nonparametric. The parameter of interest may represent, for example, a comparison of means, or survival probabilities. The confidence interval derived from the semiparametric inference, which is based on the empirical likelihood principle, improves its counterpart constructed from the common estimating equation. The empirical likelihood ratio is shown to be asymptotically chi-squared. Simulation experiments illustrate that the method based on the empirical likelihood substantially outperforms the method based on the estimating equation. A real dataset is analysed.
Estimating equation; Confidence interval; Coverage; Kaplan-Meier estimation; Empirical likelihood ratio; Empirical likelihood function
Motivated by an analysis of a real data set in ecology, we consider a class of partially nonlinear models where both of a nonparametric component and a parametric component present. We develop two new estimation procedures to estimate the parameters in the parametric component. Consistency and asymptotic normality of the resulting estimators are established. We further propose an estimation procedure and a generalized F test procedure for the nonparametric component in the partially nonlinear models. Asymptotic properties of the newly proposed estimation procedure and the test statistic are derived. Finite sample performance of the proposed inference procedures are assessed by Monte Carlo simulation studies. An application in ecology is used to illustrate the proposed methods.
Local linear regression; partial linear models; profile least squares; semiparametric models
Clinical demand for individualized “adaptive” treatment policies in diverse fields has spawned development of clinical trial methodology for their experimental evaluation via multistage designs, building upon methods intended for the analysis of naturalistically observed strategies. Because often there is no need to parametrically smooth multistage trial data (in contrast to observational data for adaptive strategies), it is possible to establish direct connections among different methodological approaches. We show by algebraic proof that the maximum likelihood (ML) and optimal semiparametric (SP) estimators of the population mean of the outcome of a treatment policy and its standard error are equal under certain experimental conditions. This result is used to develop a unified and efficient approach to design and inference for multistage trials of policies that adapt treatment according to discrete responses. We derive a sample size formula expressed in terms of a parametric version of the optimal SP population variance. Nonparametric (sample-based) ML estimation performed well in simulation studies, in terms of achieved power, for scenarios most likely to occur in real studies, even though sample sizes were based on the parametric formula. ML outperformed the SP estimator; differences in achieved power predominately reflected differences in their estimates of the population mean (rather than estimated standard errors). Neither methodology could mitigate the potential for overestimated sample sizes when strong nonlinearity was purposely simulated for certain discrete outcomes; however, such departures from linearity may not be an issue for many clinical contexts that make evaluation of competitive treatment policies meaningful.
Adaptive treatment strategy; Efficient SP estimation; Maximum likelihood; Multi-stage design; Sample size formula
In this work, we provide a new class of frailty-based competing risks models for clustered failure times data. This class is based on expanding the competing risks model of Prentice et al. (1978, Biometrics 34, 541–554) to incorporate frailty variates, with the use of cause-specific proportional hazards frailty models for all the causes. Parametric and nonparametric maximum likelihood estimators are proposed. The main advantages of the proposed class of models, in contrast to the existing models, are: (1) the inclusion of covariates; (2) the flexible structure of the dependency among the various types of failure times within a cluster; and (3) the unspecified within-subject dependency structure. The proposed estimation procedures produce the most efficient parametric and semiparametric estimators and are easy to implement. Simulation studies show that the proposed methods perform very well in practical situations.
Competing risks; Frailty model; Multivariate survival analysis; Nonparametric maximum likelihood estimator
The two-stage design is popular in epidemiology studies and clinical trials due to its cost effectiveness. Typically, the first stage sample contains cheaper and possibly biased information, while the second stage validation sample consists of a subset of subjects with accurate and complete information. In this paper, we study estimation of a survival function with right-censored survival data from a two-stage design. A non-parametric estimator is derived by combining data from both stages. We also study its large sample properties and derive pointwise and simultaneous confidence intervals for the survival function. The proposed estimator effectively reduces the variance and finite-sample bias of the Kaplan–Meier estimator solely based on the second stage validation sample. Finally, we apply our method to a real data set from a medical device post-marketing surveillance study.
censoring; Kaplan–Meier estimator; martingale; Nelson–Aalen estimator; truncation
Linear mixed effects (LME) models are useful for longitudinal data/repeated measurements. We propose a new class of covariate-adjusted LME models for longitudinal data that nonparametrically adjusts for a normalizing covariate. The proposed approach involves fitting a parametric LME model to the data after adjusting for the nonparametric effects of a baseline confounding covariate. In particular, the effect of the observable covariate on the response and predictors of the LME model is modeled nonparametrically via smooth unknown functions. In addition to covariate-adjusted estimation of fixed/population parameters and random effects, an estimation procedure for the variance components is also developed. Numerical properties of the proposed estimators are investigated with simulation studies. The consistency and convergence rates of the proposed estimators are also established. An application to a longitudinal data set on calcium absorption, accounting for baseline distortion from body mass index, illustrates the proposed methodology.
Binning; Covariance structure; Covariate-adjusted regression (CAR); Longitudinal data; Mixed model; Multiplicative effect; Varying coefficient models
In the analysis of cluster data the regression coefficients are frequently assumed to be the same across all clusters. This hampers the ability to study the varying impacts of factors on each cluster. In this paper, a semiparametric model is introduced to account for varying impacts of factors over clusters by using cluster-level covariates. It achieves the parsimony of parametrization and allows the explorations of nonlinear interactions. The random effect in the semiparametric model accounts also for within cluster correlation. Local linear based estimation procedure is proposed for estimating functional coefficients, residual variance, and within cluster correlation matrix. The asymptotic properties of the proposed estimators are established and the method for constructing simultaneous confidence bands are proposed and studied. In addition, relevant hypothesis testing problems are addressed. Simulation studies are carried out to demonstrate the methodological power of the proposed methods in the finite sample. The proposed model and methods are used to analyse the second birth interval in Bangladesh, leading to some interesting findings.
Varying-coefficient models; local linear modelling; cluster level variable; cluster effect
Improving efficiency for regression coefficients and predicting trajectories of individuals are two important aspects in analysis of longitudinal data. Both involve estimation of the covariance function. Yet, challenges arise in estimating the covariance function of longitudinal data collected at irregular time points. A class of semiparametric models for the covariance function is proposed by imposing a parametric correlation structure while allowing a nonparametric variance function. A kernel estimator is developed for the estimation of the nonparametric variance function. Two methods, a quasi-likelihood approach and a minimum generalized variance method, are proposed for estimating parameters in the correlation structure. We introduce a semiparametric varying coefficient partially linear model for longitudinal data and propose an estimation procedure for model coefficients by using a profile weighted least squares approach. Sampling properties of the proposed estimation procedures are studied and asymptotic normality of the resulting estimators is established. Finite sample performance of the proposed procedures is assessed by Monte Carlo simulation studies. The proposed methodology is illustrated by an analysis of a real data example.
Kernel regression; local linear regression; profile weighted least squares; semiparametric varying coefficient model
In this article, the authors consider a semiparametric additive hazards regression model for right-censored data that allows some censoring indicators to be missing at random. They develop a class of estimating equations and use an inverse probability weighted approach to estimate the regression parameters. Nonparametric smoothing techniques are employed to estimate the probability of non-missingness and the conditional probability of an uncensored observation. The asymptotic properties of the resulting estimators are derived. Simulation studies show that the proposed estimators perform well. They motivate and illustrate their methods with data from a brain cancer clinical trial.
Additive hazards model; censoring; kernel smoother; missing at random; weighted estimating equation
In this article we study a semiparametric generalized partially linear model when the covariates are missing at random. We propose combining local linear regression with the local quasilikelihood technique and weighted estimating equation (WEE) to estimate the parameters and nonparameters when the missing probability is known or unknown. We establish normality of the estimators of the parameter and asymptotic expansion for the estimators of the nonparametric part. We apply the proposed models and methods to a study of the relation between virologic and immunologic responses in AIDS clinical trials, in which virologic response is classified into binary variables. We also give simulation results to illustrate our approach.
AIDS clinical trial; completely missing at random; local linear; local quasilikelihood; missing at random; nonignorable; penalized quasilikelihood; weighted estimating equation
Clinical trials that randomize subjects to decision algorithms, which adapt treatments over time according to individual response, have gained considerable interest as investigators seek designs that directly inform clinical decision making. We consider designs in which subjects are randomized sequentially at decision points, among adaptive treatment options under evaluation. We present a sequential method to estimate the comparative effects of the randomized adaptive treatments, which are formalized as adaptive treatment strategies. Our causal estimators are derived using Bayesian predictive inference. We use analytical and empirical calculations to compare the predictive estimators to (i) the ‘standard’ approach that allocates the sequentially obtained data to separate strategy-specific groups as would arise from randomizing subjects at baseline; (ii) the semi-parametric approach of marginal mean models that, under appropriate experimental conditions, provides the same sequential estimator of causal differences as the proposed approach. Simulation studies demonstrate that sequential causal inference offers substantial efficiency gains over the standard approach to comparing treatments, because the predictive estimators can take advantage of the monotone structure of shared data among adaptive strategies. We further demonstrate that the semi-parametric asymptotic variances, which are marginal ‘one-step’ estimators, may exhibit significant bias, in contrast to the predictive variances. We show that the conditions under which the sequential method is attractive relative to the other two approaches are those most likely to occur in real studies.
adaptive treatment strategy; sequential causal inference; multi-stage decision problems; sequential randomization; individualized treatments
We develop methods for competing risks analysis when individual event times are correlated within clusters. Clustering arises naturally in clinical genetic studies and other settings. We develop a nonparametric estimator of cumulative incidence, and obtain robust pointwise standard errors that account for within-cluster correlation. We modify the two-sample Gray and Pepe–Mori tests for correlated competing risks data, and propose a simple two-sample test of the difference in cumulative incidence at a landmark time. In simulation studies, our estimators are asymptotically unbiased, and the modified test statistics control the type I error. The power of the respective two-sample tests is differentially sensitive to the degree of correlation; the optimal test depends on the alternative hypothesis of interest and the within-cluster correlation. For purposes of illustration, we apply our methods to a family-based prospective cohort study of hereditary breast/ovarian cancer families. For women with BRCA1 mutations, we estimate the cumulative incidence of breast cancer in the presence of competing mortality from ovarian cancer, accounting for significant within-family correlation.
BRCA1 gene; Breast neoplasm; Competing risks; Correlated survival data; Counting processes; Robust variance
In this paper, we are concerned with how to select significant variables in semiparametric modeling. Variable selection for semiparametric regression models consists of two components: model selection for nonparametric components and select significant variables for parametric portion. Thus, it is much more challenging than that for parametric models such as linear models and generalized linear models because traditional variable selection procedures including stepwise regression and the best subset selection require model selection to nonparametric components for each submodel. This leads to very heavy computational burden. In this paper, we propose a class of variable selection procedures for semiparametric regression models using nonconcave penalized likelihood. The newly proposed procedures are distinguished from the traditional ones in that they delete insignificant variables and estimate the coefficients of significant variables simultaneously. This allows us to establish the sampling properties of the resulting estimate. We first establish the rate of convergence of the resulting estimate. With proper choices of penalty functions and regularization parameters, we then establish the asymptotic normality of the resulting estimate, and further demonstrate that the proposed procedures perform as well as an oracle procedure. Semiparametric generalized likelihood ratio test is proposed to select significant variables in the nonparametric component. We investigate the asymptotic behavior of the proposed test and demonstrate its limiting null distribution follows a chi-squared distribution, which is independent of the nuisance parameters. Extensive Monte Carlo simulation studies are conducted to examine the finite sample performance of the proposed variable selection procedures.
Nonconcave penalized likelihood; SCAD; efficient score; local linear regression; partially linear model; varying coefficient models
The analysis of longitudinal repeated measures data is frequently complicated by missing data due to informative dropout. We describe a mixture model for joint distribution for longitudinal repeated measures, where the dropout distribution may be continuous and the dependence between response and dropout is semiparametric. Specifically, we assume that responses follow a varying coefficient random effects model conditional on dropout time, where the regression coefficients depend on dropout time through unspecified nonparametric functions that are estimated using step functions when dropout time is discrete (e.g., for panel data) and using smoothing splines when dropout time is continuous. Inference under the proposed semiparametric model is hence more robust than the parametric conditional linear model. The unconditional distribution of the repeated measures is a mixture over the dropout distribution. We show that estimation in the semiparametric varying coefficient mixture model can proceed by fitting a parametric mixed effects model and can be carried out on standard software platforms such as SAS. The model is used to analyze data from a recent AIDS clinical trial and its performance is evaluated using simulations.
Clinical trials; Equivalence trial; Linear mixed model; Missing data; Nonignorable dropout; Pattern-mixture model; Pediatric AIDS; Selection bias; Smoothing splines