The performance of a biomarker predicting clinical outcome is often evaluated in a large prospective study. Due to high costs associated with bioassay, investigators need to select a subset from all available patients for biomarker assessment. We consider an outcome- and auxiliary-dependent subsampling (OADS) scheme, in which the probability of selecting a patient into the subset depends on the patient’s clinical outcome and an auxiliary variable. We proposed a semiparametric empirical likelihood method to estimate the association between biomarker and clinical outcome. Asymptotic properties of the estimator are given. Simulation study shows that the proposed method outperforms alternative methods.
Auxiliary variable; Biomarker; Outcome- and auxiliary-dependent subsampling; Population-based studies; Semiparametric empirical likelihood
How to take advantage of the available auxiliary covariate information when the primary covariate of interest is not measured is a frequently encountered question in biomedical study. In this paper, we consider the multivariate failure times regression analysis in which the primary covariate is assessed only in a validation set but a continuous auxiliary covariate for it is available for all subjects in the study cohort. Under the frame of marginal hazard model, we propose to estimate the induced relative risk function in the nonvalidation set through kernel smoothing method and then obtain an estimated pseudo-partial likelihood function. The proposed estimated pseudo-partial likelihood estimator is shown to be consistent and asymptotically normal. We also give an estimator of the marginal cumulative baseline hazard function. Simulations are conducted to evaluate the finite sample performance of our proposed estimator. The proposed method is illustrated by analyzing a heart disease data from Studies of Left Ventricular Dysfunction (SOLVD).
Multivariate Failure Times; Auxiliary Covariate; Pseudo-Partial Likelihood; Kernel Smoothing; Validation Sample
In many biomedical studies, it is common that due to budget constraints, the primary covariate is only collected in a randomly selected subset from the full study cohort. Often, there is an inexpensive auxiliary covariate for the primary exposure variable that is readily available for all the cohort subjects. Valid statistical methods that make use of the auxiliary information to improve study efficiency need to be developed. To this end, we develop an estimated partial likelihood approach for correlated failure time data with auxiliary information. We assume a marginal hazard model with common baseline hazard function. The asymptotic properties for the proposed estimators are developed. The proof of the asymptotic results for the proposed estimators is nontrivial since the moments used in estimating equation are not martingale-based and the classical martingale theory is not sufficient. Instead, our proofs rely on modern empirical theory. The proposed estimator is evaluated through simulation studies and is shown to have increased efficiency compared to existing methods. The proposed methods are illustrated with a data set from the Framingham study.
Marginal hazard model; Correlated failure time; Validation set; Auxiliary covariate
The primary goal of a randomized clinical trial is to make comparisons among two or more treatments. For example, in a two-arm trial with continuous response, the focus may be on the difference in treatment means; with more than two treatments, the comparison may be based on pairwise differences. With binary outcomes, pairwise odds-ratios or log-odds ratios may be used. In general, comparisons may be based on meaningful parameters in a relevant statistical model. Standard analyses for estimation and testing in this context typically are based on the data collected on response and treatment assignment only. In many trials, auxiliary baseline covariate information may also be available, and it is of interest to exploit these data to improve the efficiency of inferences. Taking a semiparametric theory perspective, we propose a broadly-applicable approach to adjustment for auxiliary covariates to achieve more efficient estimators and tests for treatment parameters in the analysis of randomized clinical trials. Simulations and applications demonstrate the performance of the methods.
Covariate adjustment; Hypothesis test; k-arm trial; Kruskal-Wallis test; Log-odds ratio; Longitudinal data; Semiparametric theory
Outcome-dependent sampling (ODS) study designs are commonly implemented with rare diseases or when prospective studies are infeasible. In longitudinal data settings, when a repeatedly measured binary response is rare, an ODS design can be highly efficient for maximizing statistical information subject to resource limitations that prohibit covariate ascertainment of all observations. This manuscript details an ODS design where individual observations are sampled with probabilities determined by an inexpensive, time-varying auxiliary variable that is related but is not equal to the response. With the goal of validly estimating marginal model parameters based on the resulting biased sample, we propose a semi-parametric, sequential offsetted logistic regressions (SOLR) approach. The SOLR strategy first estimates the relationship between the auxiliary variable and the response and covariate data by using an offsetted logistic regression analysis where the offset is used to adjust for the biased design. Results from the auxiliary variable model are then combined with the known or estimated sampling probabilities to formulate a second offset that is used to correct for the biased design in the ultimate target model relating the longitudinal binary response to covariates. Because the target model offset is estimated with SOLR, we detail asymptotic standard error estimates that account for uncertainty associated with the auxiliary variable model. Motivated by an analysis of the BioCycle Study (Gaskins et al., Effect of daily fiber intake on reproductive function: the BioCycle Study. American Journal of Clinical Nutrition 2009; 90(4): 1061–1069) that aims to describe the relationship between reproductive health (determined by luteinizing hormone levels) and fiber consumption, we examine properties of SOLR estimators and compare them with other common approaches.
outcome-dependent sampling; bias sampling; study design; generalized estimating equations; longitudinal data analysis; binary data
As biological studies become more expensive to conduct, statistical methods that take advantage of existing auxiliary information about an expensive exposure variable are desirable in practice. Such methods should improve the study efficiency and increase the statistical power for a given number of assays. In this paper, we consider an inference procedure for multivariate failure time with auxiliary covariate information. We propose an estimated pseudo-partial likelihood estimator under the marginal hazard model framework and develop the asymptotic properties for the proposed estimator. We conduct simulation studies to evaluate the performance of the proposed method in practical situations and demonstrate the proposed method with a data set from the Studies of Left Ventricular Dysfunction (SOLVD,1991).
Auxiliary covariate; Marginal hazard model; Multivariate data; Pseudo-partial likelihood; Validation sample
In this article, we propose and explore a multivariate logistic regression model for analyzing multiple binary outcomes with incomplete covariate data where auxiliary information is available. The auxiliary data are extraneous to the regression model of interest but predictive of the covariate with missing data. describe how the auxiliary information can be incorporated into a regression model for a single binary outcome with missing covariates, and hence the efficiency of the regression estimators can be improved. We consider extending the method of Horton and Laird (2001) to the case of a multivariate logistic regression model for multiple correlated outcomes, and with missing covariates and completely observed auxiliary information. We demonstrate that in the case of moderate to strong associations among the multiple outcomes, one can achieve considerable gains in efficiency from estimators in a multivariate model as compared to the marginal estimators of the same parameters.
Asymptotic relative efficiency; Auxiliary information; Incomplete data; Logistic regression model; Missing covariates; Multiple outcomes
Outcome-dependent sampling (ODS) has been widely used in biomedical studies because it is a cost effective way to improve study efficiency. However, in the setting of a continuous outcome, the representation of the exposure variable has been limited to the framework of linear models, due to the challenge in terms of both theory and computation. Partial linear models (PLM) are a powerful inference tool to nonparametrically model the relation between an outcome and the exposure variable. In this article, we consider a case study of a partial linear model for data from an ODS design. We propose a semiparametric maximum likelihood method to make inferences with a PLM. We develop the asymptotic properties and conduct simulation studies to show that the proposed ODS estimator can produce a more efficient estimate than that from a traditional simple random sampling design with the same sample size. Using this newly developed method, we were able to explore an open question in epidemiology: whether in utero exposure to background levels of PCBs is associated with children’s intellectual impairment. Our model provides further insights into the relation between low-level PCB exposure and children’s cognitive function. The results shed new light on a body of inconsistent epidemiologic findings.
Cost-effective designs; Empirical likelihood; Outcome dependent sampling; Partial linear model; Polychlorinated biphenyls; P-spline
In this article we study a semiparametric additive risks model (McKeague and Sasieni (1994)) for two-stage design survival data where accurate information is available only on second stage subjects, a subset of the first stage study. We derive two-stage estimators by combining data from both stages. Large sample inferences are developed. As a by-product, we also obtain asymptotic properties of the single stage estimators of McKeague and Sasieni (1994) when the semiparametric additive risks model is misspecified. The proposed two-stage estimators are shown to be asymptotically more efficient than the second stage estimators. They also demonstrate smaller bias and variance for finite samples. The developed methods are illustrated using small intestine cancer data from the SEER (Surveillance, Epidemiology, and End Results) Program.
Censored data; correlation; efficiency; measurement errors; missing covariates
In cancer research, it is important to evaluate the performance of a biomarker (e.g. molecular, genetic, or imaging) that correlates patients’ prognosis or predicts patients’ response to a treatment in large prospective study. Due to overall budget constraint and high cost associated with bioassays, investigators often have to select a subset from all registered patients for biomarker assessment. To detect a potentially moderate association between the biomarker and the outcome, investigators need to decide how to select the subset of a fixed size such that the study efficiency can be enhanced. We show that, instead of drawing a simple random sample from the study cohort, greater efficiency can be achieved by allowing the selection probability to depend on the outcome and an auxiliary variable; we refer to such a sampling scheme as outcome and auxiliary-dependent subsampling (OADS). This paper is motivated by the need to analyze data from a lung cancer biomarker study that adopts the OADS design to assess EGFR mutations as a predictive biomarker for whether a subject responds to a greater extent to EGFR inhibitor drugs. We propose an estimated maximum likelihood method that accommodates the OADS design and utilizes all observed information, especially those contained in the likelihood score of EGFR mutations (an auxiliary variable of EGFR mutations) that is available to all patients. We derive the asymptotic properties of the proposed estimator and evaluate its finite sample properties via simulation. We illustrate the proposed method with a data example.
Auxiliary Variable; Biomarker; Estimated Likelihood Method; Kernel Smoother; Outcome and Auxiliary-Dependent Subsampling
We study the problem of estimation and inference on the average treatment effect in a smoking cessation trial where an outcome and some auxiliary information were measured longitudinally, and both were subject to missing values. Dynamic generalized linear mixed effects models linking the outcome, the auxiliary information, and the covariates are proposed. The maximum likelihood approach is applied to the estimation and inference on the model parameters. The average treatment effect is estimated by the G-computation approach, and the sensitivity of the treatment effect estimate to the nonignorable missing data mechanisms is investigated through the local sensitivity analysis approach. The proposed approach can handle missing data that form arbitrary missing patterns over time. We applied the proposed method to the analysis of the smoking cessation trial.
Causal effect; Potential outcomes; Robust estimator; Surrogate outcome
Outcome-dependent sampling designs have been shown to be a cost effective way to enhance study efficiency. We show that the outcome-dependent sampling design with a continuous outcome can be viewed as an extension of the two-stage case-control designs to the continuous-outcome case. We further show that the two-stage outcome-dependent sampling has a natural link with the missing-data and biased-sampling framework. Through the use of semiparametric inference and missing-data techniques, we show that a certain semiparametric maximum likelihood estimator is computationally convenient and achieves the semiparametric efficient information bound. We demonstrate this both theoretically and through simulation.
Biased sampling; Empirical process; Maximum likelihood estimation; Missing data; Outcome-dependent; Profile likelihood; Two-stage
The outcome-dependent sampling (ODS) design, which allows observation of exposure variable to depend on the outcome, has been shown to be cost efficient. In this article, we propose a new statistical inference method, an estimated penalized likelihood method, for a partial linear model in the setting of a 2-stage ODS with a continuous outcome. We develop the asymptotic properties and conduct simulation studies to demonstrate the performance of the proposed estimator. A real environmental study data set is used to illustrate the proposed method.
Biased sampling; Partial linear model; P-spline; Validation sample; 2-stage
The outcome dependent sampling scheme has been gaining attention in both the statistical literature and applied fields. Epidemiological and environmental researchers have been using it to select the observations for more powerful and cost-effective studies. Motivated by a study of the effect of in utero exposure to polychlorinated biphenyls on children’s IQ at age 7, in which the effect of an important confounding variable is nonlinear, we consider a semi-parametric regression model for data from an outcome-dependent sampling scheme where the relationship between the response and covariates is only partially parameterized. We propose a penalized spline maximum likelihood estimation (PSMLE) for inference on both the parametric and the nonparametric components and develop their asymptotic properties. Through simulation studies and an analysis of the IQ study, we compare the proposed estimator with several competing estimators. Practical considerations of implementing those estimators are discussed.
Outcome dependent sampling; Estimated likelihood; Semiparametric method; Penalized spline
Recent results for case-control sampling suggest when the covariate distribution is constrained by gene-environment independence, semiparametric estimation exploiting such independence yields a great deal of efficiency gain. We consider the efficient estimation of the treatment-biomarker interaction in two-phase sampling nested within randomized clinical trials, incorporating the independence between a randomized treatment and the baseline markers. We develop a Newton–Raphson algorithm based on the profile likelihood to compute the semiparametric maximum likelihood estimate (SPMLE). Our algorithm accommodates both continuous phase-one outcomes and continuous phase-two biomarkers. The profile information matrix is computed explicitly via numerical differentiation. In certain situations where computing the SPMLE is slow, we propose a maximum estimated likelihood estimator (MELE), which is also capable of incorporating the covariate independence. This estimated likelihood approach uses a one-step empirical covariate distribution, thus is straightforward to maximize. It offers a closed-form variance estimate with limited increase in variance relative to the fully efficient SPMLE. Our results suggest exploiting the covariate independence in two-phase sampling increases the efficiency substantially, particularly for estimating treatment-biomarker interactions.
case-only estimator; estimated likelihood; gene-environment independence; Newton–Raphson algorithm; profile likelihood; treatment-biomarker interactions
In this article, we study the estimation of mean response and regression coefficient in semiparametric regression problems when response variable is subject to nonrandom missingness. When the missingness is independent of the response conditional on high-dimensional auxiliary information, the parametric approach may misspecify the relationship between covariates and response while the nonparametric approach is infeasible because of the curse of dimensionality. To overcome this, we study a model-based approach to condense the auxiliary information and estimate the parameters of interest nonparametrically on the condensed covariate space. Our estimators possess the double robustness property, i.e., they are consistent whenever the model for the response given auxiliary covariates or the model for the missingness given auxiliary covariate is correct. We conduct a number of simulations to compare the numerical performance between our estimators and other existing estimators in the current missing data literature, including the propensity score approach and the inverse probability weighted estimating equation. A set of real data is used to illustrate our approach.
Auxiliary covariate; High-dimensional data; Kernel estimation; Missing at random; Semiparametric regression
In colorectal polyp prevention trials, estimation of the rate of recurrence of adenomas at the end of the trial may be complicated by dependent censoring, that is, time to follow-up colonoscopy and dropout may be dependent on time to recurrence. Assuming that the auxiliary variables capture the dependence between recurrence and censoring times, we propose to fit two working models with the auxiliary variables as covariates to define risk groups and then extend an existing weighted logistic regression method for independent censoring to each risk group to accommodate potential dependent censoring. In a simulation study, we show that the proposed method results in both a gain in efficiency and reduction in bias for estimating the recurrence rate. We illustrate the methodology by analyzing a recurrent adenoma dataset from a colorectal polyp prevention trial.
We consider statistical inference on a regression model in which some covariables are measured with errors together with an auxiliary variable. The proposed estimation for the regression coefficients is based on some estimating equations. This new method alleates some drawbacks of previously proposed estimations. This includes the requirment of undersmoothing the regressor functions over the auxiliary variable, the restriction on other covariables which can be observed exactly, among others. The large sample properties of the proposed estimator are established. We further propose a jackknife estimation, which consists of deleting one estimating equation (instead of one obervation) at a time. We show that the jackknife estimator of the regression coefficients and the estimating equations based estimator are asymptotically equivalent. Simulations show that the jackknife estimator has smaller biases when sample size is small or moderate. In addition, the jackknife estimation can also provide a consistent estimator of the asymptotic covariance matrix, which is robust to the heteroscedasticity. We illustrate these methods by applying them to a real data set from marketing science.
Linear regression model; noised variable; measurement error; auxiliary variable; estimating equation; jackknife estimation; asymptotic normality
Nested case-control (NCC) design is a popular sampling method in large epidemiologic studies for its cost effectiveness to investigate the temporal relationship of diseases with environmental exposures or biological precursors. Thomas’ maximum partial likelihood estimator is commonly used to estimate the regression parameters in Cox’s model for NCC data. In this paper, we consider a situation that failure/censoring information and some crude covariates are available for the entire cohort in addition to NCC data and propose an improved estimator that is asymptotically more efficient than Thomas’ estimator. We adopt a projection approach that, heretofore, has only been employed in situations of random validation sampling and show that it can be well adapted to NCC designs where the sampling scheme is a dynamic process and is not independent for controls. Under certain conditions, consistency and asymptotic normality of the proposed estimator are established and a consistent variance estimator is also developed. Furthermore, a simplified approximate estimator is proposed when the disease is rare. Extensive simulations are conducted to evaluate the finite sample performance of our proposed estimators and to compare the efficiency with Thomas’ estimator and other competing estimators. Moreover, sensitivity analyses are conducted to demonstrate the behavior of the proposed estimator when model assumptions are violated, and we find that the biases are reasonably small in realistic situations. We further demonstrate the proposed method with data from studies on Wilms’ tumor.
Counting process; Cox proportional hazards model; Martingale; Risk set sampling; Survival analysis
We study a mixed-effects model in which the response and the main covariate are linked by position. While the covariate corresponding to the observed response is not directly observable, there exists a latent covariate process that represents the underlying positional features of the covariate. When the positional features and the underlying distributions are parametric, the expectation-maximization (EM) is the most commonly used procedure. Though without the parametric assumptions, the practical feasibility of a semi-parametric EM algorithm and the corresponding inference procedures remain to be investigated. In this paper, we propose a semiparametric approach, and identify the conditions under which the semiparametric estimators share the same asymptotic properties as the unachievable estimators using the true values of the latent covariate; that is, the oracle property is achieved. We propose a Monte Carlo graphical evaluation tool to assess the adequacy of the sample size for achieving the oracle property. The semiparametric approach is later applied to data from a colon carcinogenesis study on the effects of cell DNA damage on the expression level of oncogene bcl-2. The graphical evaluation shows that, with moderate size of subunits, the numerical performance of the semiparametric estimator is very close to the asymptotic limit. It indicates that a complex EM-based implementation may at most achieve minimal improvement and is thus unnecessary.
Carcinogenesis; Consistency; Generalized estimating equation; Local linear smoothing; Mixed-effects model
Intermediate outcome variables can often be used as auxiliary variables for the true outcome of interest in randomized clinical trials. For many cancers, time to recurrence is an informative marker in predicting a patient’s overall survival outcome, and could provide auxiliary information for the analysis of survival times.
To investigate whether models linking recurrence and death combined with a multiple imputation procedure for censored observations can result in efficiency gains in the estimation of treatment effects, and be used to shorten trial lengths.
Recurrence and death times are modeled using data from 12 trials in colorectal cancer. Multiple imputation is used as a strategy for handling missing values arising from censoring. The imputation procedure uses a cure model for time to recurrence and a time-dependent Weibull proportional hazards model for time to death. Recurrence times are imputed, and then death times are imputed conditionally on recurrence times. To illustrate these methods, trials are artificially censored 2-years after the last accrual, the imputation procedure is implemented, and a log-rank test and Cox model are used to analyze and compare these new data with the original data.
The results show modest, but consistent gains in efficiency in the analysis by using the auxiliary information in recurrence times. Comparison of analyses show the treatment effect estimates and log rank test results from the 2-year censored imputed data to be in between the estimates from the original data and the artificially censored data, indicating that the procedure was able to recover some of the lost information due to censoring.
The models used are all fully parametric, requiring distributional assumptions of the data.
The proposed models may be useful to improve the efficiency in estimation of treatment effects in cancer trials and shortening trial length.
Auxiliary Variables; Colon Cancer; Cure Models; Multiple Imputation; Surrogate Endpoints
Improving efficiency for regression coefficients and predicting trajectories of individuals are two important aspects in analysis of longitudinal data. Both involve estimation of the covariance function. Yet, challenges arise in estimating the covariance function of longitudinal data collected at irregular time points. A class of semiparametric models for the covariance function is proposed by imposing a parametric correlation structure while allowing a nonparametric variance function. A kernel estimator is developed for the estimation of the nonparametric variance function. Two methods, a quasi-likelihood approach and a minimum generalized variance method, are proposed for estimating parameters in the correlation structure. We introduce a semiparametric varying coefficient partially linear model for longitudinal data and propose an estimation procedure for model coefficients by using a profile weighted least squares approach. Sampling properties of the proposed estimation procedures are studied and asymptotic normality of the resulting estimators is established. Finite sample performance of the proposed procedures is assessed by Monte Carlo simulation studies. The proposed methodology is illustrated by an analysis of a real data example.
Kernel regression; local linear regression; profile weighted least squares; semiparametric varying coefficient model
We consider nonparametric regression of a scalar outcome on a covariate when the outcome is missing at random (MAR) given the covariate and other observed auxiliary variables. We propose a class of augmented inverse probability weighted (AIPW) kernel estimating equations for nonparametric regression under MAR. We show that AIPW kernel estimators are consistent when the probability that the outcome is observed, that is, the selection probability, is either known by design or estimated under a correctly specified model. In addition, we show that a specific AIPW kernel estimator in our class that employs the fitted values from a model for the conditional mean of the outcome given covariates and auxiliaries is double-robust, that is, it remains consistent if this model is correctly specified even if the selection probabilities are modeled or specified incorrectly. Furthermore, when both models happen to be right, this double-robust estimator attains the smallest possible asymptotic variance of all AIPW kernel estimators and maximally extracts the information in the auxiliary variables. We also describe a simple correction to the AIPW kernel estimating equations that while preserving double-robustness it ensures efficiency improvement over nonaugmented IPW estimation when the selection model is correctly specified regardless of the validity of the second model used in the augmentation term. We perform simulations to evaluate the finite sample performance of the proposed estimators, and apply the methods to the analysis of the AIDS Costs and Services Utilization Survey data. Technical proofs are available online.
Asymptotics; Augmented kernel estimating equations; Double robustness; Efficiency; Inverse probability weighted kernel estimating equations; Kernel smoothing
When novel scientific questions arise after longitudinal binary data have been collected, the subsequent selection of subjects from the cohort for whom further detailed assessment will be undertaken is often necessary in order to efficiently collect new information. Key examples of additional data collection include retrospective questionnaire data, novel data linkage, or evaluation of stored biological specimens. In such cases, all data required for the new analyses are available except for the new target predictor or exposure. We propose a class of longitudinal outcome dependent sampling schemes and detail a design corrected conditional maximum likelihood analysis for highly efficient estimation of time-varying and time-invariant covariate coefficients when resource limitations prohibit exposure ascertainment on all participants. Additionally, we detail an important study planning phase that exploits available cohort data to proactively examine the feasibility of any proposed substudy as well as to inform decisions regarding the most desirable study design. The proposed designs and associated analyses are discussed in the context of a study which seeks to examine the modifying effect of an interleukin-10 cytokine single nucleotide polymorphism on asthma symptom regression in adolescents participating Childhood Asthma Management Program Continuation Study. Using this example we assume that all data necessary to conduct the study are available except subject-specific genotype data. We also assume that these data would be ascertained by analyzing stored blood samples, the cost of which limits the sample size.
Outcome dependent sampling; epidemiological study design; binary data; longitudinal data analysis; marginal models; marginalized models; time-dependent covariates
We propose a general class of semiparametric transformation models with random effects to formulate the effects of possibly time-dependent covariates on clustered or correlated failure times. This class encompasses all commonly used transformation models, including proportional hazards and proportional odds models, and it accommodates a variety of random-effects distributions, particularly Gaussian distributions. We show that the nonparametric maximum likelihood estimators of the model parameters are consistent, asymptotically normal and asymptotically efficient. We develop the corresponding likelihood-based inference procedures. Simulation studies demonstrate that the proposed methods perform well in practical situations. An illustration with a well-known diabetic retinopathy study is provided.
Correlated failure times; frailty model; nonparametric maximum likelihood estimation; proportional hazards; semiparametric efficiency; survival analysis