Outcome-dependent sampling (ODS) has been widely used in biomedical studies because it is a cost effective way to improve study efficiency. However, in the setting of a continuous outcome, the representation of the exposure variable has been limited to the framework of linear models, due to the challenge in terms of both theory and computation. Partial linear models (PLM) are a powerful inference tool to nonparametrically model the relation between an outcome and the exposure variable. In this article, we consider a case study of a partial linear model for data from an ODS design. We propose a semiparametric maximum likelihood method to make inferences with a PLM. We develop the asymptotic properties and conduct simulation studies to show that the proposed ODS estimator can produce a more efficient estimate than that from a traditional simple random sampling design with the same sample size. Using this newly developed method, we were able to explore an open question in epidemiology: whether in utero exposure to background levels of PCBs is associated with children’s intellectual impairment. Our model provides further insights into the relation between low-level PCB exposure and children’s cognitive function. The results shed new light on a body of inconsistent epidemiologic findings.
Cost-effective designs; Empirical likelihood; Outcome dependent sampling; Partial linear model; Polychlorinated biphenyls; P-spline
Two-stage design is a well-known cost-effective way for conducting biomedical studies when the exposure variable is expensive or difficult to measure. Recent research development further allowed one or both stages of the two-stage design to be outcome dependent on a continuous outcome variable. This outcome-dependent sampling feature enables further efficiency gain in parameter estimation and overall cost reduction of the study (e.g. Wang, X. and Zhou, H., 2010. Design and inference for cancer biomarker study with an outcome and auxiliary-dependent subsampling. Biometrics
66, 502–511; Zhou, H., Song, R., Wu, Y. and Qin, J., 2011. Statistical inference for a two-stage outcome-dependent sampling design with a continuous outcome. Biometrics
67, 194–202). In this paper, we develop a semiparametric mixed effect regression model for data from a two-stage design where the second-stage data are sampled with an outcome-auxiliary-dependent sample (OADS) scheme. Our method allows the cluster- or center-effects of the study subjects to be accounted for. We propose an estimated likelihood function to estimate the regression parameters. Simulation study indicates that greater study efficiency gains can be achieved under the proposed two-stage OADS design with center-effects when compared with other alternative sampling schemes. We illustrate the proposed method by analyzing a dataset from the Collaborative Perinatal Project.
Center effect; Mixed model; Outcome-auxiliary-dependent sampling; Validation sample
Two-stage design has long been recognized to be a cost-effective way for conducting biomedical studies. In many trials, auxiliary covariate information may also be available, and it is of interest to exploit these auxiliary data to improve the efficiency of inferences. In this paper, we propose a 2-stage design with continuous outcome where the second-stage data is sampled with an “outcome-auxiliary-dependent sampling” (OADS) scheme. We propose an estimator which is the maximizer for an estimated likelihood function. We show that the proposed estimator is consistent and asymptotically normally distributed. The simulation study indicates that greater study efficiency gains can be achieved under the proposed 2-stage OADS design by utilizing the auxiliary covariate information when compared with other alternative sampling schemes. We illustrate the proposed method by analyzing a data set from an environmental epidemiologic study.
Auxiliary covariate; Kernel smoothing; Outcome-auxiliary-dependent sampling; 2-stage sampling design
The outcome dependent sampling scheme has been gaining attention in both the statistical literature and applied fields. Epidemiological and environmental researchers have been using it to select the observations for more powerful and cost-effective studies. Motivated by a study of the effect of in utero exposure to polychlorinated biphenyls on children’s IQ at age 7, in which the effect of an important confounding variable is nonlinear, we consider a semi-parametric regression model for data from an outcome-dependent sampling scheme where the relationship between the response and covariates is only partially parameterized. We propose a penalized spline maximum likelihood estimation (PSMLE) for inference on both the parametric and the nonparametric components and develop their asymptotic properties. Through simulation studies and an analysis of the IQ study, we compare the proposed estimator with several competing estimators. Practical considerations of implementing those estimators are discussed.
Outcome dependent sampling; Estimated likelihood; Semiparametric method; Penalized spline
The two-stage case-control design has been widely used in epidemiology studies for its cost-effectiveness and improvement of the study efficiency (White, 1982; Breslow and Cain, 1988). The evolution of modern biomedical studies has called for cost-effective designs with a continuous outcome and exposure variables. In this paper, we propose a new two-stage outcome-dependent sampling scheme with a continuous outcome variable, where both the first-stage data and the second-stage data are from outcome-dependent sampling schemes. We develop a semiparametric empirical likelihood estimation for inference about the regression parameters in the proposed design. Simulation studies were conducted to investigate the small sample behavior of the proposed estimator. We demonstrate that, for a given statistical power, the proposed design will require a substantially smaller sample size than the alternative designs. The proposed method is illustrated with an environmental health study conducted at National Institute of Health.
Biased sampling; Empirical likelihood; Outcome dependent; Sample size; Two-stage design
Regularization methods are characterized by loss functions measuring data fits and penalty terms constraining model parameters. The commonly used quadratic loss is not suitable for classification with binary responses, whereas the loglikelihood function is not readily applicable to models where the exact distribution of observations is unknown or not fully specified. We introduce the penalized Bregman divergence by replacing the negative loglikelihood in the conventional penalized likelihood with Bregman divergence, which encompasses many commonly used loss functions in the regression analysis, classification procedures and machine learning literature. We investigate new statistical properties of the resulting class of estimators with the number pn of parameters either diverging with the sample size n or even nearly comparable with n, and develop statistical inference tools. It is shown that the resulting penalized estimator, combined with appropriate penalties, achieves the same oracle property as the penalized likelihood estimator, but asymptotically does not rely on the complete specification of the underlying distribution. Furthermore, the choice of loss function in the penalized classifiers has an asymptotically relatively negligible impact on classification performance. We illustrate the proposed method for quasilikelihood regression and binary classification with simulation evaluation and real-data application.
Consistency; Divergence minimization; Exponential family; Loss function; Optimal Bayes rule; Oracle property; Quasilikelihood
Multi-phased designs and biased sampling designs are two of the well recognized approaches to enhance study efficiency. In this paper, we propose a new and cost-effective sampling design, the two-phase probability dependent sampling design (PDS), for studies with a continuous outcome. This design will enable investigators to make efficient use of resources by targeting more informative subjects for sampling. We develop a new semiparametric empirical likelihood inference method to take advantage of data obtained through a PDS design. Simulation study results indicate that the proposed sampling scheme, coupled with the proposed estimator, is more efficient and more powerful than the existing outcome dependent sampling design and the simple random sampling design with the same sample size. We illustrate the proposed method with a real data set from an environmental epidemiologic study.
Empirical likelihood; Missing data; Semiparametric; Probability sample
Multilevel functional data is collected in many biomedical studies. For example, in a study of the effect of Nimodipine on patients with subarachnoid hemorrhage (SAH), patients underwent multiple 4-hour treatment cycles. Within each treatment cycle, subjects’ vital signs were reported every 10 minutes. This data has a natural multilevel structure with treatment cycles nested within subjects and measurements nested within cycles. Most literature on nonparametric analysis of such multilevel functional data focus on conditional approaches using functional mixed effects models. However, parameters obtained from the conditional models do not have direct interpretations as population average effects. When population effects are of interest, we may employ marginal regression models. In this work, we propose marginal approaches to fit multilevel functional data through penalized spline generalized estimating equation (penalized spline GEE). The procedure is effective for modeling multilevel correlated generalized outcomes as well as continuous outcomes without suffering from numerical difficulties. We provide a variance estimator robust to misspecification of correlation structure. We investigate the large sample properties of the penalized spline GEE estimator with multilevel continuous data and show that the asymptotics falls into two categories. In the small knots scenario, the estimated mean function is asymptotically efficient when the true correlation function is used and the asymptotic bias does not depend on the working correlation matrix. In the large knots scenario, both the asymptotic bias and variance depend on the working correlation. We propose a new method to select the smoothing parameter for penalized spline GEE based on an estimate of the asymptotic mean squared error (MSE). We conduct extensive simulation studies to examine property of the proposed estimator under different correlation structures and sensitivity of the variance estimation to the choice of smoothing parameter. Finally, we apply the methods to the SAH study to evaluate a recent debate on discontinuing the use of Nimodipine in the clinical community.
Penalized spline; GEE; Semiparametric models; Longitudinal data; Functional data
Analysis of high dimensional data often seeks to identify a subset of important features and assess their effects on the outcome. Traditional statistical inference procedures based on standard regression methods often fail in the presence of high-dimensional features. In recent years, regularization methods have emerged as promising tools for analyzing high dimensional data. These methods simultaneously select important features and provide stable estimation of their effects. Adaptive LASSO and SCAD for instance, give consistent and asymptotically normal estimates with oracle properties. However, in finite samples, it remains difficult to obtain interval estimators for the regression parameters. In this paper, we propose perturbation resampling based procedures to approximate the distribution of a general class of penalized parameter estimates. Our proposal, justified by asymptotic theory, provides a simple way to estimate the covariance matrix and confidence regions. Through finite sample simulations, we verify the ability of this method to give accurate inference and compare it to other widely used standard deviation and confidence interval estimates. We also illustrate our proposals with a data set used to study the association of HIV drug resistance and a large number of genetic mutations.
High dimensional regression; Interval estimation; Oracle property; Regularized estimation; Resampling methods
The complexity of semiparametric models poses new challenges to statistical inference and model selection that frequently arise from real applications. In this work, we propose new estimation and variable selection procedures for the semiparametric varying-coefficient partially linear model. We first study quantile regression estimates for the nonparametric varying-coefficient functions and the parametric regression coefficients. To achieve nice efficiency properties, we further develop a semiparametric composite quantile regression procedure. We establish the asymptotic normality of proposed estimators for both the parametric and nonparametric parts and show that the estimators achieve the best convergence rate. Moreover, we show that the proposed method is much more efficient than the least-squares-based method for many non-normal errors and that it only loses a small amount of efficiency for normal errors. In addition, it is shown that the loss in efficiency is at most 11.1% for estimating varying coefficient functions and is no greater than 13.6% for estimating parametric components. To achieve sparsity with high-dimensional covariates, we propose adaptive penalization methods for variable selection in the semiparametric varying-coefficient partially linear model and prove that the methods possess the oracle property. Extensive Monte Carlo simulation studies are conducted to examine the finite-sample performance of the proposed procedures. Finally, we apply the new methods to analyze the plasma beta-carotene level data.
Asymptotic relative efficiency; composite quantile regression; semiparametric varying-coefficient partially linear model; oracle properties; variable selection
Fan & Li (2001) propose a family of variable selection methods via penalized likelihood using concave penalty functions. The nonconcave penalized likelihood estimators enjoy the oracle properties, but maximizing the penalized likelihood function is computationally challenging, because the objective function is nondifferentiable and nonconcave. In this article we propose a new unified algorithm based on the local linear approximation (LLA) for maximizing the penalized likelihood for a broad class of concave penalty functions. Convergence and other theoretical properties of the LLA algorithm are established. A distinguished feature of the LLA algorithm is that at each LLA step, the LLA estimator can naturally adopt a sparse representation. Thus we suggest using the one-step LLA estimator from the LLA algorithm as the final estimates. Statistically, we show that if the regularization parameter is appropriately chosen, the one-step LLA estimates enjoy the oracle properties with good initial estimators. Computationally, the one-step LLA estimation methods dramatically reduce the computational cost in maximizing the nonconcave penalized likelihood. We conduct some Monte Carlo simulation to assess the finite sample performance of the one-step sparse estimation methods. The results are very encouraging.
AIC; BIC; Lasso; One-step estimator; Oracle Properties; SCAD
We develop continuous-time models for the analysis of environmental or ecological monitoring data such that subjects are observed at multiple monitoring time points across space. Of particular interest are additive hazards regression models where the baseline hazard function can take on flexible forms. We consider time-varying covariates and take into account spatial dependence via autoregression in space and time. We develop statistical inference for the regression coefficients via partial likelihood. Asymptotic properties, including consistency and asymptotic normality, are established for parameter estimates under suitable regularity conditions. Feasible algorithms utilizing existing statistical software packages are developed for computation. We also consider a simpler additive hazards model with homogeneous baseline hazard and develop hypothesis testing for homogeneity. A simulation study demonstrates that the statistical inference using partial likelihood has sound finite-sample properties and offers a viable alternative to maximum likelihood estimation. For illustration, we analyze data from an ecological study that monitors bark beetle colonization of red pines in a plantation of Wisconsin.
Current status data; Grouped survival data; Maximum likelihood; Multiple monitoring times; Spatial autoregression; Spatial lattice
Response-adaptive randomizations are able to assign more patients in a comparative clinical trial to the tentatively better treatment. However, due to the adaptation in patient allocation, the samples to be compared are no longer independent. At large sample sizes, many asymptotic properties of test statistics derived for independent sample comparison are still applicable in adaptive randomization provided that the patient allocation ratio converges to an appropriate target asymptotically. However, the small sample properties of commonly used test statistics in response-adaptive randomization are not fully studied.
Simulations are systematically conducted to characterize the statistical properties of eight test statistics in six response-adaptive randomization methods at six allocation targets with sample sizes ranging from 20 to 200. Since adaptive randomization is usually not recommended for sample size less than 30, the present paper focuses on the case with a sample of 30 to give general recommendations with regard to test statistics for contingency tables in response-adaptive randomization at small sample sizes.
Among all asymptotic test statistics, the Cook's correction to chi-square test (TMC) is the best in attaining the nominal size of hypothesis test. The William's correction to log-likelihood ratio test (TML) gives slightly inflated type I error and higher power as compared with TMC, but it is more robust against the unbalance in patient allocation. TMC and TML are usually the two test statistics with the highest power in different simulation scenarios. When focusing on TMC and TML, the generalized drop-the-loser urn (GDL) and sequential estimation-adjusted urn (SEU) have the best ability to attain the correct size of hypothesis test respectively. Among all sequential methods that can target different allocation ratios, GDL has the lowest variation and the highest overall power at all allocation ratios. The performance of different adaptive randomization methods and test statistics also depends on allocation targets. At the limiting allocation ratio of drop-the-loser (DL) and randomized play-the-winner (RPW) urn, DL outperforms all other methods including GDL. When comparing the power of test statistics in the same randomization method but at different allocation targets, the powers of log-likelihood-ratio, log-relative-risk, log-odds-ratio, Wald-type Z, and chi-square test statistics are maximized at their corresponding optimal allocation ratios for power. Except for the optimal allocation target for log-relative-risk, the other four optimal targets could assign more patients to the worse arm in some simulation scenarios. Another optimal allocation target, RRSIHR, proposed by Rosenberger and Sriram (Journal of Statistical Planning and Inference, 1997) is aimed at minimizing the number of failures at fixed power using Wald-type Z test statistics. Among allocation ratios that always assign more patients to the better treatment, RRSIHR usually has less variation in patient allocation, and the values of variation are consistent across all simulation scenarios. Additionally, the patient allocation at RRSIHR is not too extreme. Therefore, RRSIHR provides a good balance between assigning more patients to the better treatment and maintaining the overall power.
The Cook's correction to chi-square test and Williams' correction to log-likelihood-ratio test are generally recommended for hypothesis test in response-adaptive randomization, especially when sample sizes are small. The generalized drop-the-loser urn design is the recommended method for its good overall properties. Also recommended is the use of the RRSIHR allocation target.
This article focuses on variable selection for partially linear models when the covariates are measured with additive errors. We propose two classes of variable selection procedures, penalized least squares and penalized quantile regression, using the nonconvex penalized principle. The first procedure corrects the bias in the loss function caused by the measurement error by applying the so-called correction-for-attenuation approach, whereas the second procedure corrects the bias by using orthogonal regression. The sampling properties for the two procedures are investigated. The rate of convergence and the asymptotic normality of the resulting estimates are established. We further demonstrate that, with proper choices of the penalty functions and the regularization parameter, the resulting estimates perform asymptotically as well as an oracle procedure (Fan and Li 2001). Choice of smoothing parameters is also discussed. Finite sample performance of the proposed variable selection procedures is assessed by Monte Carlo simulation studies. We further illustrate the proposed procedures by an application.
Errors-in-variable; Error-free; Error-prone; Local linear regression; Quantile regression; SCAD
The receiver operating characteristic (ROC) curve is often used to evaluate the performance of a biomarker measured on continuous scale to predict the disease status or a clinical condition. Motivated by the need for novel study designs with better estimation efficiency and reduced study cost, we consider a biased sampling scheme that consists of a SRC and a supplemental TDC. Using this approach, investigators can oversample or undersample subjects falling into certain regions of the biomarker measure, yielding improved precision for the estimation of the ROC curve with a fixed sample size. Test-result-dependent sampling will introduce bias in estimating the predictive accuracy of the biomarker if standard ROC estimation methods are used. In this article, we discuss three approaches for analyzing data of a test-result-dependent structure with a special focus on the empirical likelihood method. We establish asymptotic properties of the empirical likelihood estimators for covariate-specific ROC curves and covariate-independent ROC curves and give their corresponding variance estimators. Simulation studies show that the empirical likelihood method yields good properties and is more efficient than alternative methods. Recommendations on number of regions, cutoff points, and subject allocation is made based on the simulation results. The proposed methods are illustrated with a data example based on an ongoing lung cancer clinical trial.
Binormal model; Covariate-independent ROC curve; Covariate-specific ROC curve; Empirical likelihood method; Test-result-dependent sampling
Motivated by an analysis of a real data set in ecology, we consider a class of partially nonlinear models where both of a nonparametric component and a parametric component present. We develop two new estimation procedures to estimate the parameters in the parametric component. Consistency and asymptotic normality of the resulting estimators are established. We further propose an estimation procedure and a generalized F test procedure for the nonparametric component in the partially nonlinear models. Asymptotic properties of the newly proposed estimation procedure and the test statistic are derived. Finite sample performance of the proposed inference procedures are assessed by Monte Carlo simulation studies. An application in ecology is used to illustrate the proposed methods.
Local linear regression; partial linear models; profile least squares; semiparametric models
We consider selecting both fixed and random effects in a general class of mixed effects models using maximum penalized likelihood (MPL) estimation along with the smoothly clipped absolute deviation (SCAD) and adaptive LASSO (ALASSO) penalty functions. The maximum penalized likelihood estimates are shown to posses consistency and sparsity properties and asymptotic normality. A model selection criterion, called the ICQ statistic, is proposed for selecting the penalty parameters (Ibrahim, Zhu and Tang, 2008). The variable selection procedure based on ICQ is shown to consistently select important fixed and random effects. The methodology is very general and can be applied to numerous situations involving random effects, including generalized linear mixed models. Simulation studies and a real data set from an Yale infant growth study are used to illustrate the proposed methodology.
ALASSO; Cholesky decomposition; EM algorithm; ICQ criterion; Mixed Effects selection; Penalized likelihood; SCAD
This paper is concerned with the selection and estimation of fixed and random effects in linear mixed effects models. We propose a class of nonconcave penalized profile likelihood methods for selecting and estimating important fixed effects. To overcome the difficulty of unknown covariance matrix of random effects, we propose to use a proxy matrix in the penalized profile likelihood. We establish conditions on the choice of the proxy matrix and show that the proposed procedure enjoys the model selection consistency where the number of fixed effects is allowed to grow exponentially with the sample size. We further propose a group variable selection strategy to simultaneously select and estimate important random effects, where the unknown covariance matrix of random effects is replaced with a proxy matrix. We prove that, with the proxy matrix appropriately chosen, the proposed procedure can identify all true random effects with asymptotic probability one, where the dimension of random effects vector is allowed to increase exponentially with the sample size. Monte Carlo simulation studies are conducted to examine the finite-sample performance of the proposed procedures. We further illustrate the proposed procedures via a real data example.
Adaptive Lasso; linear mixed effects models; group variable selection; oracle property; SCAD
The functional coefficient regression models assume that the regression coefficients vary with some “threshold” variable, providing appreciable flexibility in capturing the underlying dynamics in data and avoiding the so-called “curse of dimensionality” in multivariate nonparametric estimation. We first investigate the estimation, inference, and forecasting for the functional coefficient regression models with dependent observations via penalized splines. The P-spline approach, as a direct ridge regression shrinkage type global smoothing method, is computationally efficient and stable. With established fixed-knot asymptotics, inference is readily available. Exact inference can be obtained for fixed smoothing parameter λ, which is most appealing for finite samples. Our penalized spline approach gives an explicit model expression, which also enables multi-step-ahead forecasting via simulations. Furthermore, we examine different methods of choosing the important smoothing parameter λ: modified multi-fold cross-validation (MCV), generalized cross-validation (GCV), and an extension of empirical bias bandwidth selection (EBBS) to P-splines. In addition, we implement smoothing parameter selection using mixed model framework through restricted maximum likelihood (REML) for P-spline functional coefficient regression models with independent observations. The P-spline approach also easily allows different smoothness for different functional coefficients, which is enabled by assigning different penalty λ accordingly. We demonstrate the proposed approach by both simulation examples and a real data application.
In cancer research, it is important to evaluate the performance of a biomarker (e.g. molecular, genetic, or imaging) that correlates patients’ prognosis or predicts patients’ response to a treatment in large prospective study. Due to overall budget constraint and high cost associated with bioassays, investigators often have to select a subset from all registered patients for biomarker assessment. To detect a potentially moderate association between the biomarker and the outcome, investigators need to decide how to select the subset of a fixed size such that the study efficiency can be enhanced. We show that, instead of drawing a simple random sample from the study cohort, greater efficiency can be achieved by allowing the selection probability to depend on the outcome and an auxiliary variable; we refer to such a sampling scheme as outcome and auxiliary-dependent subsampling (OADS). This paper is motivated by the need to analyze data from a lung cancer biomarker study that adopts the OADS design to assess EGFR mutations as a predictive biomarker for whether a subject responds to a greater extent to EGFR inhibitor drugs. We propose an estimated maximum likelihood method that accommodates the OADS design and utilizes all observed information, especially those contained in the likelihood score of EGFR mutations (an auxiliary variable of EGFR mutations) that is available to all patients. We derive the asymptotic properties of the proposed estimator and evaluate its finite sample properties via simulation. We illustrate the proposed method with a data example.
Auxiliary Variable; Biomarker; Estimated Likelihood Method; Kernel Smoother; Outcome and Auxiliary-Dependent Subsampling
Summary. It is widely believed that risks of many complex diseases are determined by genetic susceptibilities, environmental exposures, and their interaction. Chatterjee and Carroll (2005, Biometrika 92, 399–418) developed an efficient retrospective maximum-likelihood method for analysis of case–control studies that exploits an assumption of gene–environment independence and leaves the distribution of the environmental covariates to be completely nonparametric. Spinka, Carroll, and Chatterjee (2005, Genetic Epidemiology 29, 108–127) extended this approach to studies where certain types of genetic information, such as haplotype phases, may be missing on some subjects. We further extend this approach to situations when some of the environmental exposures are measured with error. Using a polychotomous logistic regression model, we allow disease status to have K + 1 levels. We propose use of a pseudolikelihood and a related EM algorithm for parameter estimation. We prove consistency and derive the resulting asymptotic covariance matrix of parameter estimates when the variance of the measurement error is known and when it is estimated using replications. Inferences with measurement error corrections are complicated by the fact that the Wald test often behaves poorly in the presence of large amounts of measurement error. The likelihood-ratio (LR) techniques are known to be a good alternative. However, the LR tests are not technically correct in this setting because the likelihood function is based on an incorrect model, i.e., a prospective model in a retrospective sampling scheme. We corrected standard asymptotic results to account for the fact that the LR test is based on a likelihood-type function. The performance of the proposed method is illustrated using simulation studies emphasizing the case when genetic information is in the form of haplotypes and missing data arises from haplotype-phase ambiguity. An application of our method is illustrated using a population-based case–control study of the association between calcium intake and the risk of colorectal adenoma.
EM algorithm; Errors in variables; Gene-environment independence; Gene-environment interactions; Likelihood-ratio tests in misspecified models; Inferences in measurement error models; Profile likelihood; Semiparametric methods
Hierarchical models are extensively studied and widely used in statistics and many other scientific areas. They provide an effective tool for combining information from similar resources and achieving partial pooling of inference. Since the seminal work by James and Stein (1961) and Stein (1962), shrinkage estimation has become one major focus for hierarchical models. For the homoscedastic normal model, it is well known that shrinkage estimators, especially the James-Stein estimator, have good risk properties. The heteroscedastic model, though more appropriate for practical applications, is less well studied, and it is unclear what types of shrinkage estimators are superior in terms of the risk. We propose in this paper a class of shrinkage estimators based on Stein’s unbiased estimate of risk (SURE). We study asymptotic properties of various common estimators as the number of means to be estimated grows (p → ∞). We establish the asymptotic optimality property for the SURE estimators. We then extend our construction to create a class of semi-parametric shrinkage estimators and establish corresponding asymptotic optimality results. We emphasize that though the form of our SURE estimators is partially obtained through a normal model at the sampling level, their optimality properties do not heavily depend on such distributional assumptions. We apply the methods to two real data sets and obtain encouraging results.
Heteroscedasticity; hierarchical model; shrinkage estimator; Stein’s unbiased risk estimate (SURE); asymptotic optimality
Spatial data with covariate measurement errors have been commonly observed in public health studies. Existing work mainly concentrates on parameter estimation using Gibbs sampling, and no work has been conducted to understand and quantify the theoretical impact of ignoring measurement error on spatial data analysis in the form of the asymptotic biases in regression coefficients and variance components when measurement error is ignored. Plausible implementations, from frequentist perspectives, of maximum likelihood estimation in spatial covariate measurement error models are also elusive. In this paper, we propose a new class of linear mixed models for spatial data in the presence of covariate measurement errors. We show that the naive estimators of the regression coefficients are attenuated while the naive estimators of the variance components are inflated, if measurement error is ignored. We further develop a structural modeling approach to obtaining the maximum likelihood estimator by accounting for the measurement error. We study the large sample properties of the proposed maximum likelihood estimator, and propose an EM algorithm to draw inference. All the asymptotic properties are shown under the increasing-domain asymptotic framework. We illustrate the method by analyzing the Scottish lip cancer data, and evaluate its performance through a simulation study, all of which elucidate the importance of adjusting for covariate measurement errors.
Measurement error; Spatial data; Structural modeling; Variance components; Asymptotic bias; Consistency and asymptotic normality; Increasing domain asymptotics; EM algorithm
Nested case-control (NCC) design is a popular sampling method in large epidemiologic studies for its cost effectiveness to investigate the temporal relationship of diseases with environmental exposures or biological precursors. Thomas’ maximum partial likelihood estimator is commonly used to estimate the regression parameters in Cox’s model for NCC data. In this paper, we consider a situation that failure/censoring information and some crude covariates are available for the entire cohort in addition to NCC data and propose an improved estimator that is asymptotically more efficient than Thomas’ estimator. We adopt a projection approach that, heretofore, has only been employed in situations of random validation sampling and show that it can be well adapted to NCC designs where the sampling scheme is a dynamic process and is not independent for controls. Under certain conditions, consistency and asymptotic normality of the proposed estimator are established and a consistent variance estimator is also developed. Furthermore, a simplified approximate estimator is proposed when the disease is rare. Extensive simulations are conducted to evaluate the finite sample performance of our proposed estimators and to compare the efficiency with Thomas’ estimator and other competing estimators. Moreover, sensitivity analyses are conducted to demonstrate the behavior of the proposed estimator when model assumptions are violated, and we find that the biases are reasonably small in realistic situations. We further demonstrate the proposed method with data from studies on Wilms’ tumor.
Counting process; Cox proportional hazards model; Martingale; Risk set sampling; Survival analysis
We consider statistical inference on a regression model in which some covariables are measured with errors together with an auxiliary variable. The proposed estimation for the regression coefficients is based on some estimating equations. This new method alleates some drawbacks of previously proposed estimations. This includes the requirment of undersmoothing the regressor functions over the auxiliary variable, the restriction on other covariables which can be observed exactly, among others. The large sample properties of the proposed estimator are established. We further propose a jackknife estimation, which consists of deleting one estimating equation (instead of one obervation) at a time. We show that the jackknife estimator of the regression coefficients and the estimating equations based estimator are asymptotically equivalent. Simulations show that the jackknife estimator has smaller biases when sample size is small or moderate. In addition, the jackknife estimation can also provide a consistent estimator of the asymptotic covariance matrix, which is robust to the heteroscedasticity. We illustrate these methods by applying them to a real data set from marketing science.
Linear regression model; noised variable; measurement error; auxiliary variable; estimating equation; jackknife estimation; asymptotic normality