In a longitudinal study, suppose that the primary endpoint is the time to a specific event. This response variable, however, may be censored by an independent censoring variable or by the occurrence of one of several dependent competing events. For each study subject, a set of baseline covariates is collected. The question is how to construct a reliable prediction rule for the future subject’s profile of all competing risks of interest at a specific time point for risk-benefit decision makings. In this paper, we propose a two-stage procedure to make inferences about such subject-specific profiles. For the first step, we use a parametric model to obtain a univariate risk index score system. We then estimate consistently the average competing risks for subjects which have the same parametric index score via a nonparametric function estimation procedure. We illustrate this new proposal with the data from a randomized clinical trial for evaluating the efficacy of a treatment for prostate cancer. The primary endpoint for this study was the time to prostate cancer death, but had two types of dependent competing events, one from cardiovascular death and the other from death of other causes.
Local likelihood function; Nonparametric function estimation; Perturbation-resampling method; Risk index score
To evaluate the biological efficacy of a treatment in a randomized clinical trial, one needs to compare patients in the treatment arm who actually received treatment with the subgroup of patients in the control arm who would have received treatment had they been randomized into the treatment arm. In practice, subgroup membership in the control arm is usually unobservable. This paper develops a nonparametric inference procedure to compare subgroup probabilities with right-censored time-to-event data and unobservable subgroup membership in the control arm. We also present a procedure to estimate the onset and duration of treatment effect. The performance of our method is evaluated by simulation. An illustration is given using a randomized clinical trial for melanoma.
Biological efficacy; Censoring; Counting process; Martingale; Noncompliance; Survival probability
The complexity of semiparametric models poses new challenges to statistical inference and model selection that frequently arise from real applications. In this work, we propose new estimation and variable selection procedures for the semiparametric varying-coefficient partially linear model. We first study quantile regression estimates for the nonparametric varying-coefficient functions and the parametric regression coefficients. To achieve nice efficiency properties, we further develop a semiparametric composite quantile regression procedure. We establish the asymptotic normality of proposed estimators for both the parametric and nonparametric parts and show that the estimators achieve the best convergence rate. Moreover, we show that the proposed method is much more efficient than the least-squares-based method for many non-normal errors and that it only loses a small amount of efficiency for normal errors. In addition, it is shown that the loss in efficiency is at most 11.1% for estimating varying coefficient functions and is no greater than 13.6% for estimating parametric components. To achieve sparsity with high-dimensional covariates, we propose adaptive penalization methods for variable selection in the semiparametric varying-coefficient partially linear model and prove that the methods possess the oracle property. Extensive Monte Carlo simulation studies are conducted to examine the finite-sample performance of the proposed procedures. Finally, we apply the new methods to analyze the plasma beta-carotene level data.
Asymptotic relative efficiency; composite quantile regression; semiparametric varying-coefficient partially linear model; oracle properties; variable selection
This paper studies the generalized semiparametric regression model for longitudinal data where the covariate effects are constant for some and time-varying for others. Different link functions can be used to allow more flexible modelling of longitudinal data. The nonparametric components of the model are estimated using a local linear estimating equation and the parametric components are estimated through a profile estimating function. The method automatically adjusts for heterogeneity of sampling times, allowing the sampling strategy to depend on the past sampling history as well as possibly time-dependent covariates without specifically model such dependence. A K -fold cross-validation bandwidth selection is proposed as a working tool for locating an appropriate bandwidth. A criteria for selecting the link function is proposed to provide better fit of the data. Large sample properties of the proposed estimators are investigated. Large sample pointwise and simultaneous confidence intervals for the regression coefficients are constructed. Formal hypothesis testing procedures are proposed to check for the covariate effects and whether the effects are time-varying. A simulation study is conducted to examine the finite sample performances of the proposed estimation and hypothesis testing procedures. The methods are illustrated with a data example.
Asymptotics; Kernel smoothing; Link function; Sampling adjusted estimation; Testing time-varying effects; Weighted least squares
To estimate an overall treatment difference with data from a randomized comparative clinical study, baseline covariates are often utilized to increase the estimation precision. Using the standard analysis of covariance technique for making inferences about such an average treatment difference may not be appropriate, especially when the fitted model is nonlinear. On the other hand, the novel augmentation procedure recently studied, for example, by Zhang and others (2008. Improving efficiency of inferences in randomized clinical trials using auxiliary covariates. Biometrics
64, 707–715) is quite flexible. However, in general, it is not clear how to select covariates for augmentation effectively. An overly adjusted estimator may inflate the variance and in some cases be biased. Furthermore, the results from the standard inference procedure by ignoring the sampling variation from the variable selection process may not be valid. In this paper, we first propose an estimation procedure, which augments the simple treatment contrast estimator directly with covariates. The new proposal is asymptotically equivalent to the aforementioned augmentation method. To select covariates, we utilize the standard lasso procedure. Furthermore, to make valid inference from the resulting lasso-type estimator, a cross validation method is used. The validity of the new proposal is justified theoretically and empirically. We illustrate the procedure extensively with a well-known primary biliary cirrhosis clinical trial data set.
ANCOVA; Cross validation; Efficiency augmentation; Mayo PBC data; Semi-parametric efficiency
This paper studies a semiparametric accelerated failure time mixture model for estimation of a biological treatment effect on a latent subgroup of interest with a time-to-event outcome in randomized clinical trials. Latency is induced because membership is observable in one arm of the trial and unidentified in the other. This method is useful in randomized clinical trials with all-or-none noncompliance when patients in the control arm have no access to active treatment and in, for example, oncology trials when a biopsy used to identify the latent subgroup is performed only on subjects randomized to active treatment. We derive a computational method to estimate model parameters by iterating between an expectation step and a weighted Buckley-James optimization step. The bootstrap method is used for variance estimation, and the performance of our method is corroborated in simulation. We illustrate our method through an analysis of a multicenter selective lymphadenectomy trial for melanoma.
All-or-none non-compliance; Buckley-James estimator; Clinical trials; Competing risks; EM algorithm; Non-proportional hazards model; Treatment efficacy
Subgroup analysis arises in clinical trials research when we wish to estimate a treatment effect on a specific subgroup of the population distinguished by baseline characteristics. Many trial designs induce latent subgroups such that subgroup membership is observable in one arm of the trial and unidentified in the other. This occurs, for example, in oncology trials when a biopsy or dissection is performed only on subjects randomized to active treatment. We discuss a general framework to estimate a biological treatment effect on the latent subgroup of interest when the survival outcome is right-censored and can be appropriately modelled as a parametric function of covariate effects. Our framework builds on the application of instrumental variables methods to all-or-none treatment noncompliance. We derive a computational method to estimate model parameters via the EM algorithm and provide guidance on its implementation in standard software packages. The research is illustrated through an analysis of a seminal melanoma trial that proposed a new standard of care for the disease and involved a biopsy that is available only on patients in the treatment arm.
survival analysis; accelerated failure time model; treatment noncompliance; mixture model; EM algorithm
The primary goal of a randomized clinical trial is to make comparisons among two or more treatments. For example, in a two-arm trial with continuous response, the focus may be on the difference in treatment means; with more than two treatments, the comparison may be based on pairwise differences. With binary outcomes, pairwise odds-ratios or log-odds ratios may be used. In general, comparisons may be based on meaningful parameters in a relevant statistical model. Standard analyses for estimation and testing in this context typically are based on the data collected on response and treatment assignment only. In many trials, auxiliary baseline covariate information may also be available, and it is of interest to exploit these data to improve the efficiency of inferences. Taking a semiparametric theory perspective, we propose a broadly-applicable approach to adjustment for auxiliary covariates to achieve more efficient estimators and tests for treatment parameters in the analysis of randomized clinical trials. Simulations and applications demonstrate the performance of the methods.
Covariate adjustment; Hypothesis test; k-arm trial; Kruskal-Wallis test; Log-odds ratio; Longitudinal data; Semiparametric theory
When comparing a new treatment with a control in a randomized clinical study, the treatment effect is generally assessed by evaluating a summary measure over a specific study population. The success of the trial heavily depends on the choice of such a population. In this paper, we show a systematic, effective way to identify a promising population, for which the new treatment is expected to have a desired benefit, utilizing the data from a current study involving similar comparator treatments. Specifically, using the existing data, we first create a parametric scoring system as a function of multiple multiple baseline covariates to estimate subject-specific treatment differences. Based on this scoring system, we specify a desired level of treatment difference and obtain a subgroup of patients, defined as those whose estimated scores exceed this threshold. An empirically calibrated threshold-specific treatment difference curve across a range of score values is constructed. The subpopulation of patients satisfying any given level of treatment benefit can then be identified accordingly. To avoid bias due to overoptimism, we utilize a cross-training-evaluation method for implementing the above two-step procedure. We then show how to select the best scoring system among all competing models. Furthermore, for cases in which only a single pre-specified working model is involved, inference procedures are proposed for the average treatment difference over a range of score values using the entire data set, and are justified theoretically and numerically. Lastly, the proposals are illustrated with the data from two clinical trials in treating HIV and cardiovascular diseases. Note that if we are not interested in designing a new study for comparing similar treatments, the new procedure can also be quite useful for the management of future patients, so that treatment may be targeted towards those who would receive nontrivial benefits to compensate for the risk or cost of the new treatment.
Cross-training-evaluation; Lasso procedure; Personalized medicine; Prediction; Ridge regression; Stratified medicine; Subgroup analysis; Variable selection
Semiparametric linear transformation models have received much attention due to its high flexibility in modeling survival data. A useful estimating equation procedure was recently proposed by Chen et al. (2002) for linear transformation models to jointly estimate parametric and nonparametric terms. They showed that this procedure can yield a consistent and robust estimator. However, the problem of variable selection for linear transformation models is less studied, partially because a convenient loss function is not readily available under this context. In this paper, we propose a simple yet powerful approach to achieve both sparse and consistent estimation for linear transformation models. The main idea is to derive a profiled score from the estimating equation of Chen et al. (2002), construct a loss function based on the profile scored and its variance, and then minimize the loss subject to some shrinkage penalty. Under regularity conditions, we have shown that the resulting estimator is consistent for both model estimation and variable selection. Furthermore, the estimated parametric terms are asymptotically normal and can achieve higher efficiency than that yielded from the estimation equations. For computation, we suggest a one-step approximation algorithm which can take advantage of the LARS and build the entire solution path efficiently. Performance of the new procedure is illustrated through numerous simulations and real examples including one microarray data.
Censored survival data; Linear transformation models; LARS; Shrinkage; Variable selection
Many longitudinal studies involve relating an outcome process to a set of possibly time-varying covariates, giving rise to the usual regression models for longitudinal data. When the purpose of the study is to investigate the covariate effects when experimental environment undergoes abrupt changes or to locate the periods with different levels of covariate effects, a simple and easy-to-interpret approach is to introduce change-points in regression coefficients. In this connection, we propose a semiparametric change-point regression model, in which the error process (stochastic component) is nonparametric and the baseline mean function (functional part) is completely unspecified, the observation times are allowed to be subject-specific, and the number, locations and magnitudes of change-points are unknown and need to be estimated. We further develop an estimation procedure which combines the recent advance in semiparametric analysis based on counting process argument and multiple change-points inference, and discuss its large sample properties, including consistency and asymptotic normality, under suitable regularity conditions. Simulation results show that the proposed methods work well under a variety of scenarios. An application to a real data set is also given.
Change-points; Counting process; Time-varying coefficient
The hazard ratio provides a natural target for assessing a treatment effect with survival data, with the Cox proportional hazards model providing a widely used special case. In general, the hazard ratio is a function of time and provides a visual display of the temporal pattern of the treatment effect. A variety of nonproportional hazards models have been proposed in the literature. However, available methods for flexibly estimating a possibly time-dependent hazard ratio are limited. Here, we investigate a semiparametric model that allows a wide range of time-varying hazard ratio shapes. Point estimates as well as pointwise confidence intervals and simultaneous confidence bands of the hazard ratio function are established under this model. The average hazard ratio function is also studied to assess the cumulative treatment effect. We illustrate corresponding inference procedures using coronary heart disease data from the Women's Health Initiative estrogen plus progestin clinical trial.
Clinical trial; Empirical process; Gaussian process; Hazard ratio; Simultaneous inference; Survival analysis; Treatment–time interaction
We propose a double-penalized likelihood approach for simultaneous model selection and estimation in semiparametric mixed models for longitudinal data. Two types of penalties are jointly imposed on the ordinary log-likelihood: the roughness penalty on the nonparametric baseline function and a nonconcave shrinkage penalty on linear coefficients to achieve model sparsity. Compared to existing estimation equation based approaches, our procedure provides valid inference for data with missing at random, and will be more efficient if the specified model is correct. Another advantage of the new procedure is its easy computation for both regression components and variance parameters. We show that the double penalized problem can be conveniently reformulated into a linear mixed model framework, so that existing software can be directly used to implement our method. For the purpose of model inference, we derive both frequentist and Bayesian variance estimation for estimated parametric and nonparametric components. Simulation is used to evaluate and compare the performance of our method to the existing ones. We then apply the new method to a real data set from a lactation study.
Correlated data; Gaussian stochastic process; Linear mixed models; Smoothly clipped absolute deviation; Smoothing splines
We propose a family of regression models to adjust for nonrandom dropouts in the analysis of longitudinal outcomes with fully observed covariates. The approach conceptually focuses on generalized linear models with random effects. A novel formulation of a shared random effects model is presented and shown to provide a dropout selection parameter with a meaningful interpretation. The proposed semiparametric and parametric models are made part of a sensitivity analysis to delineate the range of inferences consistent with observed data. Concerns about model identifiability are addressed by fixing some model parameters to construct functional estimators that are used as the basis of a global sensitivity test for parameter contrasts. Our simulation studies demonstrate a large reduction of bias for the semiparametric model relatively to the parametric model at times where the dropout rate is high or the dropout model is misspecified. The methodology’s practical utility is illustrated in a data analysis.
Exponential family distribution; Functional estimators; Global sensitivity analysis; Informative dropout; Infimum/Supremum statistic; Nonparametric mixture; Uniform convergence; non-identifiable models
There has been great interest in developing nonlinear structural equation models and associated statistical inference procedures, including estimation and model selection methods. In this paper a general semiparametric structural equation model (SSEM) is developed in which the structural equation is composed of nonparametric functions of exogenous latent variables and fixed covariates on a set of latent endogenous variables. A basis representation is used to approximate these nonparametric functions in the structural equation and the Bayesian Lasso method coupled with a Markov Chain Monte Carlo (MCMC) algorithm is used for simultaneous estimation and model selection. The proposed method is illustrated using a simulation study and data from the Affective Dynamics and Individual Differences (ADID) study. Results demonstrate that our method can accurately estimate the unknown parameters and correctly identify the true underlying model.
Bayesian Lasso; Latent variable; Spline; Structural equation model
In a prospective cohort study, information on clinical parameters, tests and molecular markers is often collected. Such information is useful to predict patient prognosis and to select patients for targeted therapy. We propose a new graphical approach, the positive predictive value (PPV) curve, to quantify the predictive accuracy of prognostic markers measured on a continuous scale with censored failure time outcome. The proposed method highlights the need to consider both predictive values and the marker distribution in the population when evaluating a marker, and it provides a common scale for comparing different markers. We consider both semiparametric and nonparametric based estimating procedures. In addition, we provide asymptotic distribution theory and resampling based procedures for making statistical inference. We illustrate our approach with numerical studies and datasets from the Seattle Heart Failure Study.
Prognostic accuracy; Positive predictive value; Survival analysis
Extensive baseline covariate information is routinely collected on
participants in randomized clinical trials, and it is well-recognized that a
proper covariate-adjusted analysis can improve the efficiency of inference on
the treatment effect. However, such covariate adjustment has engendered
considerable controversy, as post hoc selection of covariates
may involve subjectivity and lead to biased inference, while prior specification
of the adjustment may exclude important variables from consideration.
Accordingly, how to select covariates objectively to gain maximal efficiency is
of broad interest. We propose and study the use of modern variable selection
methods for this purpose in the context of a semiparametric framework, under
which variable selection in modeling the relationship between outcome and
covariates is separated from estimation of the treatment effect, circumventing
the potential for selection bias associated with standard analysis of covariance
methods. We demonstrate that such objective variable selection techniques
combined with this framework can identify key variables and lead to unbiased and
efficient inference on the treatment effect. A critical issue in finite samples
is validity of estimators of uncertainty, such as standard errors and confidence
intervals for the treatment effect. We propose an approach to estimation of
sampling variation of estimated treatment effect and show its superior
performance relative to that of existing methods.
covariate adjustment; false selection rate control; oracle property; semiparametric treatment effect estimation; shrinkage methods; variable selection
We consider frailty models with additive semiparametric covariate effects
for clustered failure time data. We propose a doubly penalized partial
likelihood (DPPL) procedure to estimate the nonparametric functions using
smoothing splines. We show that the DPPL estimators could be obtained from
fitting an augmented working frailty model with parametric covariate effects,
whereas the nonparametric functions being estimated as linear combinations of
fixed and random effects, and the smoothing parameters being estimated as extra
variance components. This approach allows us to conveniently estimate all model
components within a unified frailty model framework. We evaluate the finite
sample performance of the proposed method via a simulation study, and apply the
method to analyze data from a study of sexually transmitted infections
Doubly penalized partial likelihood; smoothing spline; Gaussian frailty; sexually transmitted disease; Smoothing parameter; Variance components
This article describes a class of heteroscedastic generalized linear regression models in which a subset of the regression parameters are rescaled nonparametrically, and develops efficient semiparametric inferences for the parametric components of the models. Such models provide a means to adapt for heterogeneity in the data due to varying exposures, varying levels of aggregation, and so on. The class of models considered includes generalized partially linear models and nonparametrically scaled link function models as special cases. We present an algorithm to estimate the scale function nonparametrically, and obtain asymptotic distribution theory for regression parameter estimates. In particular, we establish that the asymptotic covariance of the semiparametric estimator for the parametric part of the model achieves the semiparametric lower bound. We also describe bootstrap-based goodness-of-scale test. We illustrate the methodology with simulations, published data, and data from collaborative research on ultrasound safety.
Generalized linear regression; Heteroscedasticity; Nonparametric regression; Partially linear model; Semiparametric efficiency; Varying-coefficient model
Motivated by an analysis of a real data set in ecology, we consider a class of partially nonlinear models where both of a nonparametric component and a parametric component present. We develop two new estimation procedures to estimate the parameters in the parametric component. Consistency and asymptotic normality of the resulting estimators are established. We further propose an estimation procedure and a generalized F test procedure for the nonparametric component in the partially nonlinear models. Asymptotic properties of the newly proposed estimation procedure and the test statistic are derived. Finite sample performance of the proposed inference procedures are assessed by Monte Carlo simulation studies. An application in ecology is used to illustrate the proposed methods.
Local linear regression; partial linear models; profile least squares; semiparametric models
To compare two samples of censored data, we propose a unified semiparametric inference for the parameter of interest when the model for one sample is parametric and that for the other is nonparametric. The parameter of interest may represent, for example, a comparison of means, or survival probabilities. The confidence interval derived from the semiparametric inference, which is based on the empirical likelihood principle, improves its counterpart constructed from the common estimating equation. The empirical likelihood ratio is shown to be asymptotically chi-squared. Simulation experiments illustrate that the method based on the empirical likelihood substantially outperforms the method based on the estimating equation. A real dataset is analysed.
Estimating equation; Confidence interval; Coverage; Kaplan-Meier estimation; Empirical likelihood ratio; Empirical likelihood function
Clinical demand for individualized “adaptive” treatment policies in diverse fields has spawned development of clinical trial methodology for their experimental evaluation via multistage designs, building upon methods intended for the analysis of naturalistically observed strategies. Because often there is no need to parametrically smooth multistage trial data (in contrast to observational data for adaptive strategies), it is possible to establish direct connections among different methodological approaches. We show by algebraic proof that the maximum likelihood (ML) and optimal semiparametric (SP) estimators of the population mean of the outcome of a treatment policy and its standard error are equal under certain experimental conditions. This result is used to develop a unified and efficient approach to design and inference for multistage trials of policies that adapt treatment according to discrete responses. We derive a sample size formula expressed in terms of a parametric version of the optimal SP population variance. Nonparametric (sample-based) ML estimation performed well in simulation studies, in terms of achieved power, for scenarios most likely to occur in real studies, even though sample sizes were based on the parametric formula. ML outperformed the SP estimator; differences in achieved power predominately reflected differences in their estimates of the population mean (rather than estimated standard errors). Neither methodology could mitigate the potential for overestimated sample sizes when strong nonlinearity was purposely simulated for certain discrete outcomes; however, such departures from linearity may not be an issue for many clinical contexts that make evaluation of competitive treatment policies meaningful.
Adaptive treatment strategy; Efficient SP estimation; Maximum likelihood; Multi-stage design; Sample size formula
In this work, we provide a new class of frailty-based competing risks models for clustered failure times data. This class is based on expanding the competing risks model of Prentice et al. (1978, Biometrics 34, 541–554) to incorporate frailty variates, with the use of cause-specific proportional hazards frailty models for all the causes. Parametric and nonparametric maximum likelihood estimators are proposed. The main advantages of the proposed class of models, in contrast to the existing models, are: (1) the inclusion of covariates; (2) the flexible structure of the dependency among the various types of failure times within a cluster; and (3) the unspecified within-subject dependency structure. The proposed estimation procedures produce the most efficient parametric and semiparametric estimators and are easy to implement. Simulation studies show that the proposed methods perform very well in practical situations.
Competing risks; Frailty model; Multivariate survival analysis; Nonparametric maximum likelihood estimator
Linear mixed effects (LME) models are useful for longitudinal data/repeated measurements. We propose a new class of covariate-adjusted LME models for longitudinal data that nonparametrically adjusts for a normalizing covariate. The proposed approach involves fitting a parametric LME model to the data after adjusting for the nonparametric effects of a baseline confounding covariate. In particular, the effect of the observable covariate on the response and predictors of the LME model is modeled nonparametrically via smooth unknown functions. In addition to covariate-adjusted estimation of fixed/population parameters and random effects, an estimation procedure for the variance components is also developed. Numerical properties of the proposed estimators are investigated with simulation studies. The consistency and convergence rates of the proposed estimators are also established. An application to a longitudinal data set on calcium absorption, accounting for baseline distortion from body mass index, illustrates the proposed methodology.
Binning; Covariance structure; Covariate-adjusted regression (CAR); Longitudinal data; Mixed model; Multiplicative effect; Varying coefficient models
In the analysis of cluster data the regression coefficients are frequently assumed to be the same across all clusters. This hampers the ability to study the varying impacts of factors on each cluster. In this paper, a semiparametric model is introduced to account for varying impacts of factors over clusters by using cluster-level covariates. It achieves the parsimony of parametrization and allows the explorations of nonlinear interactions. The random effect in the semiparametric model accounts also for within cluster correlation. Local linear based estimation procedure is proposed for estimating functional coefficients, residual variance, and within cluster correlation matrix. The asymptotic properties of the proposed estimators are established and the method for constructing simultaneous confidence bands are proposed and studied. In addition, relevant hypothesis testing problems are addressed. Simulation studies are carried out to demonstrate the methodological power of the proposed methods in the finite sample. The proposed model and methods are used to analyse the second birth interval in Bangladesh, leading to some interesting findings.
Varying-coefficient models; local linear modelling; cluster level variable; cluster effect