Two-stage design is a well-known cost-effective way for conducting biomedical studies when the exposure variable is expensive or difficult to measure. Recent research development further allowed one or both stages of the two-stage design to be outcome dependent on a continuous outcome variable. This outcome-dependent sampling feature enables further efficiency gain in parameter estimation and overall cost reduction of the study (e.g. Wang, X. and Zhou, H., 2010. Design and inference for cancer biomarker study with an outcome and auxiliary-dependent subsampling. Biometrics
66, 502–511; Zhou, H., Song, R., Wu, Y. and Qin, J., 2011. Statistical inference for a two-stage outcome-dependent sampling design with a continuous outcome. Biometrics
67, 194–202). In this paper, we develop a semiparametric mixed effect regression model for data from a two-stage design where the second-stage data are sampled with an outcome-auxiliary-dependent sample (OADS) scheme. Our method allows the cluster- or center-effects of the study subjects to be accounted for. We propose an estimated likelihood function to estimate the regression parameters. Simulation study indicates that greater study efficiency gains can be achieved under the proposed two-stage OADS design with center-effects when compared with other alternative sampling schemes. We illustrate the proposed method by analyzing a dataset from the Collaborative Perinatal Project.
Center effect; Mixed model; Outcome-auxiliary-dependent sampling; Validation sample
In cancer research, it is important to evaluate the performance of a biomarker (e.g. molecular, genetic, or imaging) that correlates patients’ prognosis or predicts patients’ response to a treatment in large prospective study. Due to overall budget constraint and high cost associated with bioassays, investigators often have to select a subset from all registered patients for biomarker assessment. To detect a potentially moderate association between the biomarker and the outcome, investigators need to decide how to select the subset of a fixed size such that the study efficiency can be enhanced. We show that, instead of drawing a simple random sample from the study cohort, greater efficiency can be achieved by allowing the selection probability to depend on the outcome and an auxiliary variable; we refer to such a sampling scheme as outcome and auxiliary-dependent subsampling (OADS). This paper is motivated by the need to analyze data from a lung cancer biomarker study that adopts the OADS design to assess EGFR mutations as a predictive biomarker for whether a subject responds to a greater extent to EGFR inhibitor drugs. We propose an estimated maximum likelihood method that accommodates the OADS design and utilizes all observed information, especially those contained in the likelihood score of EGFR mutations (an auxiliary variable of EGFR mutations) that is available to all patients. We derive the asymptotic properties of the proposed estimator and evaluate its finite sample properties via simulation. We illustrate the proposed method with a data example.
Auxiliary Variable; Biomarker; Estimated Likelihood Method; Kernel Smoother; Outcome and Auxiliary-Dependent Subsampling
Outcome-dependent sampling (ODS) study designs are commonly implemented with rare diseases or when prospective studies are infeasible. In longitudinal data settings, when a repeatedly measured binary response is rare, an ODS design can be highly efficient for maximizing statistical information subject to resource limitations that prohibit covariate ascertainment of all observations. This manuscript details an ODS design where individual observations are sampled with probabilities determined by an inexpensive, time-varying auxiliary variable that is related but is not equal to the response. With the goal of validly estimating marginal model parameters based on the resulting biased sample, we propose a semi-parametric, sequential offsetted logistic regressions (SOLR) approach. The SOLR strategy first estimates the relationship between the auxiliary variable and the response and covariate data by using an offsetted logistic regression analysis where the offset is used to adjust for the biased design. Results from the auxiliary variable model are then combined with the known or estimated sampling probabilities to formulate a second offset that is used to correct for the biased design in the ultimate target model relating the longitudinal binary response to covariates. Because the target model offset is estimated with SOLR, we detail asymptotic standard error estimates that account for uncertainty associated with the auxiliary variable model. Motivated by an analysis of the BioCycle Study (Gaskins et al., Effect of daily fiber intake on reproductive function: the BioCycle Study. American Journal of Clinical Nutrition 2009; 90(4): 1061–1069) that aims to describe the relationship between reproductive health (determined by luteinizing hormone levels) and fiber consumption, we examine properties of SOLR estimators and compare them with other common approaches.
outcome-dependent sampling; bias sampling; study design; generalized estimating equations; longitudinal data analysis; binary data
The two-stage case-control design has been widely used in epidemiology studies for its cost-effectiveness and improvement of the study efficiency (White, 1982; Breslow and Cain, 1988). The evolution of modern biomedical studies has called for cost-effective designs with a continuous outcome and exposure variables. In this paper, we propose a new two-stage outcome-dependent sampling scheme with a continuous outcome variable, where both the first-stage data and the second-stage data are from outcome-dependent sampling schemes. We develop a semiparametric empirical likelihood estimation for inference about the regression parameters in the proposed design. Simulation studies were conducted to investigate the small sample behavior of the proposed estimator. We demonstrate that, for a given statistical power, the proposed design will require a substantially smaller sample size than the alternative designs. The proposed method is illustrated with an environmental health study conducted at National Institute of Health.
Biased sampling; Empirical likelihood; Outcome dependent; Sample size; Two-stage design
The current goal of initial antiretroviral (ARV) therapy is suppression of plasma human immunodeficiency virus (HIV)-1 RNA levels to below 200 copies per milliliter. A proportion of HIV-infected patients who initiate antiretroviral therapy in clinical practice or antiretroviral clinical trials either fail to suppress HIV-1 RNA or have HIV-1 RNA levels rebound on therapy. Frequently, these patients have sustained CD4 cell counts responses and limited or no clinical symptoms and, therefore, have potentially limited indications for altering therapy which they may be tolerating well despite increased viral replication. On the other hand, increased viral replication on therapy leads to selection of resistance mutations to the antiretroviral agents comprising their therapy and potentially cross-resistance to other agents in the same class decreasing the likelihood of response to subsequent antiretroviral therapy. The optimal time to switch antiretroviral therapy to ensure sustained virologic suppression and prevent clinical events in patients who have rebound in their HIV-1 RNA, yet are stable, is not known. Randomized clinical trials to compare early versus delayed switching have been difficult to design and more difficult to enroll. In some clinical trials, such as the AIDS Clinical Trials Group (ACTG) Study A5095, patients randomized to initial antiretroviral treatment combinations, who fail to suppress HIV-1 RNA or have a rebound of HIV-1 RNA on therapy are allowed to switch from the initial ARV regimen to a new regimen, based on clinician and patient decisions. We delineate a statistical framework to estimate the effect of early versus late regimen change using data from ACTG A5095 in the context of two-stage designs.
In causal inference, a large class of doubly robust estimators are derived through semiparametric theory with applications to missing data problems. This class of estimators is motivated through geometric arguments and relies on large samples for good performance. By now, several authors have noted that a doubly robust estimator may be suboptimal when the outcome model is misspecified even if it is semiparametric efficient when the outcome regression model is correctly specified. Through auxiliary variables, two-stage designs, and within the contextual backdrop of our scientific problem and clinical study, we propose improved doubly robust, locally efficient estimators of a population mean and average causal effect for early versus delayed switching to second-line ARV treatment regimens. Our analysis of the ACTG A5095 data further demonstrates how methods that use auxiliary variables can improve over methods that ignore them. Using the methods developed here, we conclude that patients who switch within 8 weeks of virologic failure have better clinical outcomes, on average, than patients who delay switching to a new second-line ARV regimen after failing on the initial regimen. Ordinary statistical methods fail to find such differences. This article has online supplementary material.
Causal inference; Double robustness; Longitudinal data analysis; Missing data; Rubin causal model; Semiparametric efficient estimation
The performance of a biomarker predicting clinical outcome is often evaluated in a large prospective study. Due to high costs associated with bioassay, investigators need to select a subset from all available patients for biomarker assessment. We consider an outcome- and auxiliary-dependent subsampling (OADS) scheme, in which the probability of selecting a patient into the subset depends on the patient’s clinical outcome and an auxiliary variable. We proposed a semiparametric empirical likelihood method to estimate the association between biomarker and clinical outcome. Asymptotic properties of the estimator are given. Simulation study shows that the proposed method outperforms alternative methods.
Auxiliary variable; Biomarker; Outcome- and auxiliary-dependent subsampling; Population-based studies; Semiparametric empirical likelihood
How to take advantage of the available auxiliary covariate information when the primary covariate of interest is not measured is a frequently encountered question in biomedical study. In this paper, we consider the multivariate failure times regression analysis in which the primary covariate is assessed only in a validation set but a continuous auxiliary covariate for it is available for all subjects in the study cohort. Under the frame of marginal hazard model, we propose to estimate the induced relative risk function in the nonvalidation set through kernel smoothing method and then obtain an estimated pseudo-partial likelihood function. The proposed estimated pseudo-partial likelihood estimator is shown to be consistent and asymptotically normal. We also give an estimator of the marginal cumulative baseline hazard function. Simulations are conducted to evaluate the finite sample performance of our proposed estimator. The proposed method is illustrated by analyzing a heart disease data from Studies of Left Ventricular Dysfunction (SOLVD).
Multivariate Failure Times; Auxiliary Covariate; Pseudo-Partial Likelihood; Kernel Smoothing; Validation Sample
In many biomedical studies, it is common that due to budget constraints, the primary covariate is only collected in a randomly selected subset from the full study cohort. Often, there is an inexpensive auxiliary covariate for the primary exposure variable that is readily available for all the cohort subjects. Valid statistical methods that make use of the auxiliary information to improve study efficiency need to be developed. To this end, we develop an estimated partial likelihood approach for correlated failure time data with auxiliary information. We assume a marginal hazard model with common baseline hazard function. The asymptotic properties for the proposed estimators are developed. The proof of the asymptotic results for the proposed estimators is nontrivial since the moments used in estimating equation are not martingale-based and the classical martingale theory is not sufficient. Instead, our proofs rely on modern empirical theory. The proposed estimator is evaluated through simulation studies and is shown to have increased efficiency compared to existing methods. The proposed methods are illustrated with a data set from the Framingham study.
Marginal hazard model; Correlated failure time; Validation set; Auxiliary covariate
We consider nonparametric regression of a scalar outcome on a covariate when the outcome is missing at random (MAR) given the covariate and other observed auxiliary variables. We propose a class of augmented inverse probability weighted (AIPW) kernel estimating equations for nonparametric regression under MAR. We show that AIPW kernel estimators are consistent when the probability that the outcome is observed, that is, the selection probability, is either known by design or estimated under a correctly specified model. In addition, we show that a specific AIPW kernel estimator in our class that employs the fitted values from a model for the conditional mean of the outcome given covariates and auxiliaries is double-robust, that is, it remains consistent if this model is correctly specified even if the selection probabilities are modeled or specified incorrectly. Furthermore, when both models happen to be right, this double-robust estimator attains the smallest possible asymptotic variance of all AIPW kernel estimators and maximally extracts the information in the auxiliary variables. We also describe a simple correction to the AIPW kernel estimating equations that while preserving double-robustness it ensures efficiency improvement over nonaugmented IPW estimation when the selection model is correctly specified regardless of the validity of the second model used in the augmentation term. We perform simulations to evaluate the finite sample performance of the proposed estimators, and apply the methods to the analysis of the AIDS Costs and Services Utilization Survey data. Technical proofs are available online.
Asymptotics; Augmented kernel estimating equations; Double robustness; Efficiency; Inverse probability weighted kernel estimating equations; Kernel smoothing
Multi-phased designs and biased sampling designs are two of the well recognized approaches to enhance study efficiency. In this paper, we propose a new and cost-effective sampling design, the two-phase probability dependent sampling design (PDS), for studies with a continuous outcome. This design will enable investigators to make efficient use of resources by targeting more informative subjects for sampling. We develop a new semiparametric empirical likelihood inference method to take advantage of data obtained through a PDS design. Simulation study results indicate that the proposed sampling scheme, coupled with the proposed estimator, is more efficient and more powerful than the existing outcome dependent sampling design and the simple random sampling design with the same sample size. We illustrate the proposed method with a real data set from an environmental epidemiologic study.
Empirical likelihood; Missing data; Semiparametric; Probability sample
In many randomized clinical trials, the primary response variable, for example, the survival time, is not observed directly after the patients enroll in the study but rather observed after some period of time (lag time). It is often the case that such a response variable is missing for some patients due to censoring that occurs when the study ends before the patient’s response is observed or when the patients drop out of the study. It is often assumed that censoring occurs at random which is referred to as noninformative censoring; however, in many cases such an assumption may not be reasonable. If the missing data are not analyzed properly, the estimator or test for the treatment effect may be biased. In this paper, we use semiparametric theory to derive a class of consistent and asymptotically normal estimators for the treatment effect parameter which are applicable when the response variable is right censored. The baseline auxiliary covariates and post-treatment auxiliary covariates, which may be time-dependent, are also considered in our semiparametric model. These auxiliary covariates are used to derive estimators that both account for informative censoring and are more efficient then the estimators which do not consider the auxiliary covariates.
Informative censoring; Influence function; Logrank test; Nuisance tangent space; Proportional hazards model; Regular and asymptotically linear estimators
The primary goal of a randomized clinical trial is to make comparisons among two or more treatments. For example, in a two-arm trial with continuous response, the focus may be on the difference in treatment means; with more than two treatments, the comparison may be based on pairwise differences. With binary outcomes, pairwise odds-ratios or log-odds ratios may be used. In general, comparisons may be based on meaningful parameters in a relevant statistical model. Standard analyses for estimation and testing in this context typically are based on the data collected on response and treatment assignment only. In many trials, auxiliary baseline covariate information may also be available, and it is of interest to exploit these data to improve the efficiency of inferences. Taking a semiparametric theory perspective, we propose a broadly-applicable approach to adjustment for auxiliary covariates to achieve more efficient estimators and tests for treatment parameters in the analysis of randomized clinical trials. Simulations and applications demonstrate the performance of the methods.
Covariate adjustment; Hypothesis test; k-arm trial; Kruskal-Wallis test; Log-odds ratio; Longitudinal data; Semiparametric theory
In molecular epidemiology studies biospecimen data are collected, often with the purpose of evaluating the synergistic role between a biomarker and another feature on an outcome. Typically, biomarker data are collected on only a proportion of subjects eligible for study, leading to a missing data problem. Missing data methods, however, are not customarily incorporated into analyses. Instead, complete-case (CC) analyses are performed, which can result in biased and inefficient estimates.
Through simulations, we characterized the performance of CC methods when interaction effects are estimated. We also investigated whether standard multiple imputation (MI) could improve estimation over CC methods when the data are not missing at random (NMAR) and auxiliary information may or may not exist.
CC analyses were shown to result in considerable bias and efficiency loss. While MI reduced bias and increased efficiency over CC methods under specific conditions, it too resulted in biased estimates depending on the strength of the auxiliary data available and the nature of the missingness. In particular, CC performed better than MI when extreme values of the covariate were more likely to be missing, while MI outperformed CC when missingness of the covariate related to both the covariate and outcome. MI always improved performance when strong auxiliary data were available. In a real study, MI estimates of interaction effects were attenuated relative to those from a CC approach.
Our findings suggest the importance of incorporating missing data methods into the analysis. If the data are MAR, standard MI is a reasonable method. Auxiliary variables may make this assumption more reasonable even if the data are NMAR. Under NMAR we emphasize caution when using standard MI and recommend it over CC only when strong auxiliary data are available. MI, with the missing data mechanism specified, is an alternative when the data are NMAR. In all cases, it is recommended to take advantage of MI's ability to account for the uncertainty of these assumptions.
Outcome-dependent sampling (ODS) has been widely used in biomedical studies because it is a cost effective way to improve study efficiency. However, in the setting of a continuous outcome, the representation of the exposure variable has been limited to the framework of linear models, due to the challenge in terms of both theory and computation. Partial linear models (PLM) are a powerful inference tool to nonparametrically model the relation between an outcome and the exposure variable. In this article, we consider a case study of a partial linear model for data from an ODS design. We propose a semiparametric maximum likelihood method to make inferences with a PLM. We develop the asymptotic properties and conduct simulation studies to show that the proposed ODS estimator can produce a more efficient estimate than that from a traditional simple random sampling design with the same sample size. Using this newly developed method, we were able to explore an open question in epidemiology: whether in utero exposure to background levels of PCBs is associated with children’s intellectual impairment. Our model provides further insights into the relation between low-level PCB exposure and children’s cognitive function. The results shed new light on a body of inconsistent epidemiologic findings.
Cost-effective designs; Empirical likelihood; Outcome dependent sampling; Partial linear model; Polychlorinated biphenyls; P-spline
In this paper we study the Buckley-James estimator of accelerated failure time models with auxiliary covariates. Instead of postulating distributional assumptions on the auxiliary covariates, we use a local polynomial approximation method to accommodate them into the Buckley-James estimating equations. The regression parameters are obtained iteratively by minimizing a consecutive distance of the estimates. Asymptotic properties of the proposed estimator are investigated. Simulation studies show that the efficiency gain of using auxiliary information is remarkable when compared to just using the validation sample. The method is applied to the PBC data from the Mayo Clinic trial in primary biliary cirrhosis as an illustration.
To estimate an overall treatment difference with data from a randomized comparative clinical study, baseline covariates are often utilized to increase the estimation precision. Using the standard analysis of covariance technique for making inferences about such an average treatment difference may not be appropriate, especially when the fitted model is nonlinear. On the other hand, the novel augmentation procedure recently studied, for example, by Zhang and others (2008. Improving efficiency of inferences in randomized clinical trials using auxiliary covariates. Biometrics
64, 707–715) is quite flexible. However, in general, it is not clear how to select covariates for augmentation effectively. An overly adjusted estimator may inflate the variance and in some cases be biased. Furthermore, the results from the standard inference procedure by ignoring the sampling variation from the variable selection process may not be valid. In this paper, we first propose an estimation procedure, which augments the simple treatment contrast estimator directly with covariates. The new proposal is asymptotically equivalent to the aforementioned augmentation method. To select covariates, we utilize the standard lasso procedure. Furthermore, to make valid inference from the resulting lasso-type estimator, a cross validation method is used. The validity of the new proposal is justified theoretically and empirically. We illustrate the procedure extensively with a well-known primary biliary cirrhosis clinical trial data set.
ANCOVA; Cross validation; Efficiency augmentation; Mayo PBC data; Semi-parametric efficiency
As biological studies become more expensive to conduct, statistical methods that take advantage of existing auxiliary information about an expensive exposure variable are desirable in practice. Such methods should improve the study efficiency and increase the statistical power for a given number of assays. In this paper, we consider an inference procedure for multivariate failure time with auxiliary covariate information. We propose an estimated pseudo-partial likelihood estimator under the marginal hazard model framework and develop the asymptotic properties for the proposed estimator. We conduct simulation studies to evaluate the performance of the proposed method in practical situations and demonstrate the proposed method with a data set from the Studies of Left Ventricular Dysfunction (SOLVD,1991).
Auxiliary covariate; Marginal hazard model; Multivariate data; Pseudo-partial likelihood; Validation sample
In this article we study a semiparametric additive risks model (McKeague and Sasieni (1994)) for two-stage design survival data where accurate information is available only on second stage subjects, a subset of the first stage study. We derive two-stage estimators by combining data from both stages. Large sample inferences are developed. As a by-product, we also obtain asymptotic properties of the single stage estimators of McKeague and Sasieni (1994) when the semiparametric additive risks model is misspecified. The proposed two-stage estimators are shown to be asymptotically more efficient than the second stage estimators. They also demonstrate smaller bias and variance for finite samples. The developed methods are illustrated using small intestine cancer data from the SEER (Surveillance, Epidemiology, and End Results) Program.
Censored data; correlation; efficiency; measurement errors; missing covariates
Summary. It is widely believed that risks of many complex diseases are determined by genetic susceptibilities, environmental exposures, and their interaction. Chatterjee and Carroll (2005, Biometrika 92, 399–418) developed an efficient retrospective maximum-likelihood method for analysis of case–control studies that exploits an assumption of gene–environment independence and leaves the distribution of the environmental covariates to be completely nonparametric. Spinka, Carroll, and Chatterjee (2005, Genetic Epidemiology 29, 108–127) extended this approach to studies where certain types of genetic information, such as haplotype phases, may be missing on some subjects. We further extend this approach to situations when some of the environmental exposures are measured with error. Using a polychotomous logistic regression model, we allow disease status to have K + 1 levels. We propose use of a pseudolikelihood and a related EM algorithm for parameter estimation. We prove consistency and derive the resulting asymptotic covariance matrix of parameter estimates when the variance of the measurement error is known and when it is estimated using replications. Inferences with measurement error corrections are complicated by the fact that the Wald test often behaves poorly in the presence of large amounts of measurement error. The likelihood-ratio (LR) techniques are known to be a good alternative. However, the LR tests are not technically correct in this setting because the likelihood function is based on an incorrect model, i.e., a prospective model in a retrospective sampling scheme. We corrected standard asymptotic results to account for the fact that the LR test is based on a likelihood-type function. The performance of the proposed method is illustrated using simulation studies emphasizing the case when genetic information is in the form of haplotypes and missing data arises from haplotype-phase ambiguity. An application of our method is illustrated using a population-based case–control study of the association between calcium intake and the risk of colorectal adenoma.
EM algorithm; Errors in variables; Gene-environment independence; Gene-environment interactions; Likelihood-ratio tests in misspecified models; Inferences in measurement error models; Profile likelihood; Semiparametric methods
Due to early colonoscopy for some participants, interval-censored observations can be introduced into the data of a colorectal polyp prevention trial. The censoring could be dependent of risk of recurrence if the reasons of having early colonoscopy are associated with recurrence. This can complicate estimation of the recurrence rate.
We propose to use midpoint imputation to convert interval-censored data problems to right censored data problems. To adjust for potential dependent censoring, we use information from auxiliary variables to define risk groups to perform the weighted Kaplan-Meier estimation to the midpoint imputed data. The risk groups are defined using two risk scores derived from two working proportional hazards models with the auxiliary variables as the covariates. One is for the recurrence time and the other is for the censoring time. The method described here is explored by simulation and illustrated with an example from a colorectal polyp prevention trial.
We first show that midpoint imputation under an assumption of independent censoring will produce an unbiased estimate of recurrence rate at the end of the trial, which is often the main interest of a colorectal polyp prevention trial, and then show in simulations that the weighted Kaplan-Meier method using the information from auxiliary variables based on the midpoint imputed data can improve efficiency in a situation with independent censoring and reduce bias in a situation with dependent censoring compared to the conventional methods, while estimating the recurrence rate at the end of the trial.
The research in this paper uses midpoint imputation to handle interval-censored observations and then uses the information from auxiliary variables to adjust for dependent censoring by incorporating them into the weighted Kaplan-Meier estimation. This approach can handle a situation with multiple auxiliary variables by deriving two risk scores from two working PH models. Although the idea of this approach might appear simple, the results do show that the weighted Kaplan-Meier approach can gain efficiency and reduce bias due to dependent censoring.
We consider statistical inference on a regression model in which some covariables are measured with errors together with an auxiliary variable. The proposed estimation for the regression coefficients is based on some estimating equations. This new method alleates some drawbacks of previously proposed estimations. This includes the requirment of undersmoothing the regressor functions over the auxiliary variable, the restriction on other covariables which can be observed exactly, among others. The large sample properties of the proposed estimator are established. We further propose a jackknife estimation, which consists of deleting one estimating equation (instead of one obervation) at a time. We show that the jackknife estimator of the regression coefficients and the estimating equations based estimator are asymptotically equivalent. Simulations show that the jackknife estimator has smaller biases when sample size is small or moderate. In addition, the jackknife estimation can also provide a consistent estimator of the asymptotic covariance matrix, which is robust to the heteroscedasticity. We illustrate these methods by applying them to a real data set from marketing science.
Linear regression model; noised variable; measurement error; auxiliary variable; estimating equation; jackknife estimation; asymptotic normality
Missing data are common in longitudinal studies due to drop-out, loss to follow-up, and death. Likelihood-based mixed effects models for longitudinal data give valid estimates when the data are ignorably missing; that is, the parameters for the missing data process are distinct from those of the main model for the outcome, and the data are missing at random (MAR). These assumptions, however, are not testable without further information. In some studies, there is additional information available in the form of an auxiliary variable known to be correlated with the missing outcome of interest. Availability of such auxiliary information provides us with an opportunity to test the MAR assumption. If the MAR assumption is violated, such information can be utilized to reduce or eliminate bias when the missing data process depends on the unobserved outcome through the auxiliary information. We compare two methods of utilizing the auxiliary information: joint modeling of the outcome of interest and the auxiliary variable, and multiple imputation (MI). Simulation studies are performed to examine the two methods. The likelihood-based joint modeling approach is consistent and most efficient when correctly specified. However, mis-specification of the joint distribution can lead to biased results. MI is slightly less efficient than a correct joint modeling approach but more robust to model mis-specification when all the variables affecting the missing data mechanism and the missing outcome are included in the imputation model. An example is presented from a dementia screening study.
auxiliary variable MAR (A-MAR); joint modeling; linear mixed effects model; missing data; MNAR; multiple imputation (MI)
The case-cohort study involves two-phase sampling: simple random sampling from an infinite super-population at phase one and stratified random sampling from a finite cohort at phase two. Standard analyses of case-cohort data involve solution of inverse probability weighted (IPW) estimating equations, with weights determined by the known phase two sampling fractions. The variance of parameter estimates in (semi)parametric models, including the Cox model, is the sum of two terms: (i) the model based variance of the usual estimates that would be calculated if full data were available for the entire cohort; and (ii) the design based variance from IPW estimation of the unknown cohort total of the efficient influence function (IF) contributions. This second variance component may be reduced by adjusting the sampling weights, either by calibration to known cohort totals of auxiliary variables correlated with the IF contributions or by their estimation using these same auxiliary variables. Both adjustment methods are implemented in the R
survey package. We derive the limit laws of coefficients estimated using adjusted weights. The asymptotic results suggest practical methods for construction of auxiliary variables that are evaluated by simulation of case-cohort samples from the National Wilms Tumor Study and by log-linear modeling of case-cohort data from the Atherosclerosis Risk in Communities Study. Although not semiparametric efficient, estimators based on adjusted weights may come close to achieving full efficiency within the class of augmented IPW estimators.
Calibration; Case-cohort; Estimation; Log-linear model; Semiparametric
The Canadian Study of Health and Aging (CSHA) employed a prevalent cohort design to study survival after onset of dementia, where patients with dementia were sampled and the onset time of dementia was determined retrospectively. The prevalent cohort sampling scheme favors individuals who survive longer. Thus, the observed survival times are subject to length bias. In recent years, there has been a rising interest in developing estimation procedures for prevalent cohort survival data that not only account for length bias but also actually exploit the incidence distribution of the disease to improve efficiency. This article considers semiparametric estimation of the Cox model for the time from dementia onset to death under a stationarity assumption with respect to the disease incidence. Under the stationarity condition, the semiparametric maximum likelihood estimation is expected to be fully efficient yet difficult to perform for statistical practitioners, as the likelihood depends on the baseline hazard function in a complicated way. Moreover, the asymptotic properties of the semiparametric maximum likelihood estimator are not well-studied. Motivated by the composite likelihood method (Besag 1974), we develop a composite partial likelihood method that retains the simplicity of the popular partial likelihood estimator and can be easily performed using standard statistical software. When applied to the CSHA data, the proposed method estimates a significant difference in survival between the vascular dementia group and the possible Alzheimer’s disease group, while the partial likelihood method for left-truncated and right-censored data yields a greater standard error and a 95% confidence interval covering 0, thus highlighting the practical value of employing a more efficient methodology. To check the assumption of stable disease for the CSHA data, we also present new graphical and numerical tests in the article. The R code used to obtain the maximum composite partial likelihood estimator for the CSHA data is available in the online Supplementary Material, posted on the journal web site.
Backward and forward recurrence time; Cross-sectional sampling; Random truncation; Renewal processes
Outcome-dependent sampling designs have been shown to be a cost effective way to enhance study efficiency. We show that the outcome-dependent sampling design with a continuous outcome can be viewed as an extension of the two-stage case-control designs to the continuous-outcome case. We further show that the two-stage outcome-dependent sampling has a natural link with the missing-data and biased-sampling framework. Through the use of semiparametric inference and missing-data techniques, we show that a certain semiparametric maximum likelihood estimator is computationally convenient and achieves the semiparametric efficient information bound. We demonstrate this both theoretically and through simulation.
Biased sampling; Empirical process; Maximum likelihood estimation; Missing data; Outcome-dependent; Profile likelihood; Two-stage