Litter decomposition rate (k) is typically estimated from proportional litter mass loss data using models that assume constant, normally distributed errors. However, such data often show non-normal errors with reduced variance near bounds (0 or 1), potentially leading to biased k estimates. We compared the performance of nonlinear regression using the beta distribution, which is well-suited to bounded data and this type of heteroscedasticity, to standard nonlinear regression (normal errors) on simulated and real litter decomposition data. Although the beta model often provided better fits to the simulated data (based on the corrected Akaike Information Criterion, AICc), standard nonlinear regression was robust to violation of homoscedasticity and gave equally or more accurate k estimates as nonlinear beta regression. Our simulation results also suggest that k estimates will be most accurate when study length captures mid to late stage decomposition (50–80% mass loss) and the number of measurements through time is ≥5. Regression method and data transformation choices had the smallest impact on k estimates during mid and late stage decomposition. Estimates of k were more variable among methods and generally less accurate during early and end stage decomposition. With real data, neither model was predominately best; in most cases the models were indistinguishable based on AICc, and gave similar k estimates. However, when decomposition rates were high, normal and beta model k estimates often diverged substantially. Therefore, we recommend a pragmatic approach where both models are compared and the best is selected for a given data set. Alternatively, both models may be used via model averaging to develop weighted parameter estimates. We provide code to perform nonlinear beta regression with freely available software.
Toxicologists and pharmacologists often describe toxicity of a chemical using parameters of a nonlinear regression model. Thus estimation of parameters of a nonlinear regression model is an important problem. The estimates of the parameters and their uncertainty estimates depend upon the underlying error variance structure in the model. Typically, a priori the researcher would know if the error variances are homoscedastic (i.e., constant across dose) or if they are heteroscedastic (i.e., the variance is a function of dose). Motivated by this concern, in this article we introduce an estimation procedure based on preliminary test which selects an appropriate estimation procedure accounting for the underlying error variance structure. Since outliers and influential observations are common in toxicological data, the proposed methodology uses M-estimators. The asymptotic properties of the preliminary test estimator are investigated; in particular its asymptotic covariance matrix is derived. The performance of the proposed estimator is compared with several standard estimators using simulation studies. The proposed methodology is also illustrated using a data set obtained from the National Toxicology Program.
Asymptotic normality; Dose-response study; Heteroscedasticity; Hill model; M-estimation procedure; Preliminary test estimation; Toxicology
We investigate methods for regression analysis when covariates are measured with errors. In a subset of the whole cohort, a surrogate variable is available for the true unobserved exposure variable. The surrogate variable satisfies the classical measurement error model, but it may not have repeated measurements. In addition to the surrogate variables that are available among the subjects in the calibration sample, we assume that there is an instrumental variable (IV) that is available for all study subjects. An IV is correlated with the unobserved true exposure variable and hence can be useful in the estimation of the regression coefficients. We propose a robust best linear estimator that uses all the available data, which is the most efficient among a class of consistent estimators. The proposed estimator is shown to be consistent and asymptotically normal under very weak distributional assumptions. For Poisson or linear regression, the proposed estimator is consistent even if the measurement error from the surrogate or IV is heteroscedastic. Finite-sample performance of the proposed estimator is examined and compared with other estimators via intensive simulation studies. The proposed method and other methods are applied to a bladder cancer case–control study.
Calibration sample; Estimating equation; Heteroscedastic measurement error; Nonparametric correction
In a cocaine dependence treatment study, we use linear and nonlinear regression models to model posttreatment cocaine craving scores and first cocaine relapse time. A subset of the covariates are summary statistics derived from baseline daily cocaine use trajectories, such as baseline cocaine use frequency and average daily use amount. These summary statistics are subject to estimation error and can therefore cause biased estimators for the regression coefficients. Unlike classical measurement error problems, the error we encounter here is heteroscedastic with an unknown distribution, and there are no replicates for the error-prone variables or instrumental variables. We propose two robust methods to correct for the bias: a computationally efficient method-of-moments-based method for linear regression models and a subsampling extrapolation method that is generally applicable to both linear and nonlinear regression models. Simulations and an application to the cocaine dependence treatment data are used to illustrate the efficacy of the proposed methods. Asymptotic theory and variance estimation for the proposed subsampling extrapolation method and some additional simulation results are described in the online supplementary material.
Bias correction; Method-of-moments correction; Subsampling extrapolation
Spatial data with covariate measurement errors have been commonly observed in public health studies. Existing work mainly concentrates on parameter estimation using Gibbs sampling, and no work has been conducted to understand and quantify the theoretical impact of ignoring measurement error on spatial data analysis in the form of the asymptotic biases in regression coefficients and variance components when measurement error is ignored. Plausible implementations, from frequentist perspectives, of maximum likelihood estimation in spatial covariate measurement error models are also elusive. In this paper, we propose a new class of linear mixed models for spatial data in the presence of covariate measurement errors. We show that the naive estimators of the regression coefficients are attenuated while the naive estimators of the variance components are inflated, if measurement error is ignored. We further develop a structural modeling approach to obtaining the maximum likelihood estimator by accounting for the measurement error. We study the large sample properties of the proposed maximum likelihood estimator, and propose an EM algorithm to draw inference. All the asymptotic properties are shown under the increasing-domain asymptotic framework. We illustrate the method by analyzing the Scottish lip cancer data, and evaluate its performance through a simulation study, all of which elucidate the importance of adjusting for covariate measurement errors.
Measurement error; Spatial data; Structural modeling; Variance components; Asymptotic bias; Consistency and asymptotic normality; Increasing domain asymptotics; EM algorithm
We consider the problem of high-dimensional regression under non-constant error variances. Despite being a common phenomenon in biological applications, heteroscedasticity has, so far, been largely ignored in high-dimensional analysis of genomic data sets. We propose a new methodology that allows non-constant error variances for high-dimensional estimation and model selection. Our method incorporates heteroscedasticity by simultaneously modeling both the mean and variance components via a novel doubly regularized approach. Extensive Monte Carlo simulations indicate that our proposed procedure can result in better estimation and variable selection than existing methods when heteroscedasticity arises from the presence of predictors explaining error variances and outliers. Further, we demonstrate the presence of heteroscedasticity in and apply our method to an expression quantitative trait loci (eQTLs) study of 112 yeast segregants. The new procedure can automatically account for heteroscedasticity in identifying the eQTLs that are associated with gene expression variations and lead to smaller prediction errors. These results demonstrate the importance of considering heteroscedasticity in eQTL data analysis.
Generalized least squares; Heteroscedasticity; Large p small n; Model selection; Sparse regression; Variance estimation
Occupational, environmental, and nutritional epidemiologists are often interested in estimating the prospective effect of time-varying exposure variables such as cumulative exposure or cumulative updated average exposure, in relation to chronic disease endpoints such as cancer incidence and mortality. From exposure validation studies, it is apparent that many of the variables of interest are measured with moderate to substantial error. Although the ordinary regression calibration approach is approximately valid and efficient for measurement error correction of relative risk estimates from the Cox model with time-independent point exposures when the disease is rare, it is not adaptable for use with time-varying exposures. By re-calibrating the measurement error model within each risk set, a risk set regression calibration method is proposed for this setting. An algorithm for a bias-corrected point estimate of the relative risk using an RRC approach is presented, followed by the derivation of an estimate of its variance, resulting in a sandwich estimator. Emphasis is on methods applicable to the main study/external validation study design, which arises in important applications. Simulation studies under several assumptions about the error model were carried out, which demonstrated the validity and efficiency of the method in finite samples. The method was applied to a study of diet and cancer from Harvard’s Health Professionals Follow-up Study (HPFS).
Cox proportional hazards model; Measurement error; Risk set regression calibration; Time-varying covariates
Data from many scientific areas often come with measurement error. Density or distribution function estimation from contaminated data and nonparametric regression with errors-in-variables are two important topics in measurement error models. In this paper, we present a new software package decon for
R, which contains a collection of functions that use the deconvolution kernel methods to deal with the measurement error problems. The functions allow the errors to be either homoscedastic or heteroscedastic. To make the deconvolution estimators computationally more efficient in
R, we adapt the fast Fourier transform algorithm for density estimation with error-free data to the deconvolution kernel estimation. We discuss the practical selection of the smoothing parameter in deconvolution methods and illustrate the use of the package through both simulated and real examples.
measurement error models; deconvolution; errors-in-variables problems; smoothing; kernel; faster Fourier transform; heteroscedastic errors; bandwidth selection
Multivariate local polynomial fitting is applied to the multivariate linear heteroscedastic regression model. Firstly, the local polynomial fitting is applied to estimate heteroscedastic function, then the coefficients of regression model are obtained by using generalized least squares method. One noteworthy feature of our approach is that we avoid the testing for heteroscedasticity by improving the traditional two-stage method. Due to non-parametric technique of local polynomial estimation, it is unnecessary to know the form of heteroscedastic function. Therefore, we can improve the estimation precision, when the heteroscedastic function is unknown. Furthermore, we verify that the regression coefficients is asymptotic normal based on numerical simulations and normal Q-Q plots of residuals. Finally, the simulation results and the local polynomial estimation of real data indicate that our approach is surely effective in finite-sample situations.
In-vivo measurement of bone lead by means of K-X ray fluorescence (KXRF) is the preferred biological marker of chronic exposure to lead. Unfortunately, considerable measurement error associated with KXRF estimations can introduce bias in estimates of the effect of bone lead when this variable is included as the exposure in a regression model. Estimates of uncertainty reported by the KXRF instrument reflect the variance of the measurement error and, although they can be used to correct the measurement error bias, they are seldom used in epidemiological statistical analyses. Errors-in-variables regression (EIV) allows for correction of bias caused by measurement error in predictor variables, based on the knowledge of the reliability of such variables. The authors propose a way to obtain reliability coefficients for bone lead measurements from uncertainty data reported by the KXRF instrument and compare, by use of Monte Carlo simulations, results obtained using EIV regression models versus those obtained by the standard procedures. Results of the simulations show that Ordinary Least Square (OLS) regression models provide severely biased estimates of effect, and that EIV provides nearly unbiased estimates. Although EIV effect estimates are more imprecise, their mean squared error is much smaller than that of OLS estimates. In conclusion, EIV is a better alternative than OLS to estimate the effect of bone lead when measured by KXRF.
Lead; KXRF; measurement error; errors-in-variables model; simulations
We present a semi-parametric deconvolution estimator for the density function of a random variable X that is measured with error, a common challenge in many epidemiological studies. Traditional deconvolution estimators rely only on assumptions about the distribution of X and the error in its measurement, and ignore information available in auxiliary variables. Our method assumes the availability of a covariate vector statistically related to X by a mean–variance function regression model, where regression errors are normally distributed and independent of the measurement errors. Simulations suggest that the estimator achieves a much lower integrated squared error than the observed-data kernel density estimator when models are correctly specified and the assumption of normal regression errors is met. We illustrate the method using anthropometric measurements of newborns to estimate the density function of newborn length.
density estimation; measurement error; mean–variance function model
Epidemiologic research focuses on estimating exposure-disease associations. In some applications the exposure may be dichotomized, for instance when threshold levels of the exposure are of primary public health interest (e.g., consuming 5 or more fruits and vegetables per day may reduce cancer risk). Errors in exposure variables are known to yield biased regression coefficients in exposure-disease models. Methods for bias-correction with continuous mismeasured exposures have been extensively discussed, and are often based on validation substudies, where the “true” and imprecise exposures are observed on a small subsample. In this paper, we focus on biases associated with dichotomization of a mismeasured continuous exposure. The amount of bias, in relation to measurement error in the imprecise continuous predictor, and choice of dichotomization cut point are discussed. Measurement error correction via regression calibration is developed for this scenario, and compared to naïly using the dichotomized mismeasured predictor in linear exposure-disease models. Properties of the measurement error correction method (i.e., bias, mean-squared error) are assessed via simulations.
Epidemiologic research focuses on estimating exposure-disease associations. In some applications the exposure may be dichotomized, for instance when threshold levels of the exposure are of primary public health interest (e.g., consuming 5 or more fruits and vegetables per day may reduce cancer risk). Errors in exposure variables are known to yield biased regression coefficients in exposure-disease models. Methods for bias-correction with continuous mismeasured exposures have been extensively discussed, and are often based on validation substudies, where the “true” and imprecise exposures are observed on a small subsample. In this paper, we focus on biases associated with dichotomization of a mismeasured continuous exposure. The amount of bias, in relation to measurement error in the imprecise continuous predictor, and choice of dichotomization cut point are discussed. Measurement error correction via regression calibration is developed for this scenario, and compared to naïvely using the dichotomized mismeasured predictor in linear exposure-disease models. Properties of the measurement error correction method (i.e., bias, mean-squared error) are assessed via simulations.
measurement error correction; dichotomizing covariates; regression calibration
Regression quantiles can be substantially biased when the covariates are measured with error. In this paper we propose a new method that produces consistent linear quantile estimation in the presence of covariate measurement error. The method corrects the measurement error induced bias by constructing joint estimating equations that simultaneously hold for all the quantile levels. An iterative EM-type estimation algorithm to obtain the solutions to such joint estimation equations is provided. The finite sample performance of the proposed method is investigated in a simulation study, and compared to the standard regression calibration approach. Finally, we apply our methodology to part of the National Collaborative Perinatal Project growth data, a longitudinal study with an unusual measurement error structure.
Correction for attenuation; Growth curves; Longitudinal data; Measurement error; Quantile regression; Regression calibration; Regression quantiles
This paper extends the line-segment parametrization of the structural measurement error model to situations in which the error variance on both variables is not constant over all observations. Under these conditions, we develop a method-of-moments estimate of the slope, and derive its asymptotic variance. We further derive an accurate estimator of the variability of the slope estimate based on sample data in a rather general setting. We perform simulations which validate our results and demonstrate that our estimates are more precise than estimates under a different model when the measurement error variance is not small. Lastly, we illustrate our estimation approach using real data involving heteroscedastic measurement error, and compare its performance to that of earlier models.
Delta method; heteroscedasticity; measurement error; method of moments; slope estimation
Motivation: Immunoassays are primary diagnostic and research tools throughout the medical and life sciences. The common approach to the processing of immunoassay data involves estimation of the calibration curve followed by inversion of the calibration function to read off the concentration estimates. This approach, however, does not lend itself easily to acceptable estimation of confidence limits on the estimated concentrations. Such estimates must account for uncertainty in the calibration curve as well as uncertainty in the target measurement. Even point estimates can be problematic: because of the non-linearity of calibration curves and error heteroscedasticity, the neglect of components of measurement error can produce significant bias.
Methods: We have developed a Bayesian approach for the estimation of concentrations from immunoassay data that treats the propagation of measurement error appropriately. The method uses Markov Chain Monte Carlo (MCMC) to approximate the posterior distribution of the target concentrations and numerically compute the relevant summary statistics. Software implementing the method is freely available for public use.
Results: The new method was tested on both simulated and experimental datasets with different measurement error models. The method outperformed the common inverse method on samples with large measurement errors. Even in cases with extreme measurements where the common inverse method failed, our approach always generated reasonable estimates for the target concentrations.
Availability: Project name: Baecs; Project home page: www.computationalimmunology.org/utilities/; Operating systems: Linux, MacOS X and Windows; Programming language: C++; License: Free for Academic Use.
Supplementary data are available at Bioinformatics online.
In many environmental epidemiology studies, the locations and/or times of exposure measurements and health assessments do not match. In such settings, health effects analyses often use the predictions from an exposure model as a covariate in a regression model. Such exposure predictions contain some measurement error as the predicted values do not equal the true exposures. We provide a framework for spatial measurement error modeling, showing that smoothing induces a Berkson-type measurement error with nondiagonal error structure. From this viewpoint, we review the existing approaches to estimation in a linear regression health model, including direct use of the spatial predictions and exposure simulation, and explore some modified approaches, including Bayesian models and out-of-sample regression calibration, motivated by measurement error principles. We then extend this work to the generalized linear model framework for health outcomes. Based on analytical considerations and simulation results, we compare the performance of all these approaches under several spatial models for exposure. Our comparisons underscore several important points. First, exposure simulation can perform very poorly under certain realistic scenarios. Second, the relative performance of the different methods depends on the nature of the underlying exposure surface. Third, traditional measurement error concepts can help to explain the relative practical performance of the different methods. We apply the methods to data on the association between levels of particulate matter and birth weight in the greater Boston area.
Air pollution; Measurement error; Predictions; Spatial misalignment
Many regression analyses involve explanatory variables that are measured with error, and failing to account for this error is well known to lead to biased point and interval estimates of the regression coefficients. We present here a new general method for adjusting for covariate error. Our method consists of an approximate version of the Stefanski-Nakamura corrected score approach, using the method of regularization to obtain an approximate solution of the relevant integral equation. We develop the theory in the setting of classical likelihood models; this setting covers, for example, linear regression, nonlinear regression, logistic regression, and Poisson regression. The method is extremely general in terms of the types of measurement error models covered, and is a functional method in the sense of not involving assumptions on the distribution of the true covariate. We discuss the theoretical properties of the method and present simulation results in the logistic regression setting (univariate and multivariate). For illustration, we apply the method to data from the Harvard Nurses’ Health Study concerning the relationship between physical activity and breast cancer mortality in the period following a diagnosis of breast cancer.
Errors in variables; nonlinear models; logistic regression; integral equations
Li and Tiwari (2008) recently developed a corrected Z-test statistic for comparing the trends in cancer age-adjusted mortality and incidence rates across overlapping geographic regions, by properly adjusting for the correlation between the slopes of the fitted simple linear regression equations. One of their key assumptions is that the error variances have unknown but common variance. However, since the age-adjusted rates are linear combinations of mortality or incidence counts, arising naturally from an underlying Poisson process, this constant variance assumption may be violated. This paper develops a weighted-least-squares based test that incorporates heteroscedastic error variances, and thus significantly extends the work of Li and Tiwari. The proposed test generally outperforms the aforementioned test through simulations and through application to the age-adjusted mortality data from the Surveillance, Epidemiology, and End Results (SEER) Program of the National Cancer Institute.
Age-adjusted cancer rates; annual percent change (APC); cancer surveillance; trends; weighted-Least-Squares estimation; hypothesis testing
Often important confounders are not available in studies. Sensitivity analyses based on the relation of single, but not multiple, unmeasured confounders with an exposure of interest in a separate validation study have been proposed. The authors controlled for measured confounding in the main cohort using propensity scores (PS) and addressed unmeasured confounding by estimating two additional PS in a validation study. The ‘error-prone’ PS exclusively used information available in the main cohort. The ‘gold-standard’ PS additionally included covariates available only in the validation study. Based on these two PS in the validation study, regression calibration was applied to adjust regression coefficients. This propensity score calibration (PSC) adjusts for unmeasured confounding in cohort studies with validation data under certain, usually untestable, assumptions. PSC was used to assess nonsteroidal antiinflammatory drugs (NSAID) and 1-year mortality in a large cohort of elderly. ‘Traditional’ adjustment resulted in a relative risk (RR) in NSAID users of 0.80 (95% confidence interval: 0.77–0.83) compared to an unadjusted RR of 0.68 (0.66–0.71). Application of PSC resulted in a more plausible RR of 1.06 (1.00–1.12). Until validity and limitations of PSC have been assessed in different settings, the method should be seen as a sensitivity analysis.
epidemiologic methods; research design; confounding factors (epidemiology); bias (epidemiology); cohort studies; propensity score calibration; AUC, area under the receiver operating characteristic curve; CI, confidence interval; NSAID, nonsteroidal antiinflammatory drug; OR, odds ratio; PS, propensity score; PSC, propensity score calibration; RR, relative risk
The existing theory of the wild bootstrap has focused on linear estimators. In this note, we broaden its validity by providing a class of weight distributions that is asymptotically valid for quantile regression estimators. As most weight distributions in the literature lead to biased variance estimates for nonlinear estimators of linear regression, we propose a modification of the wild bootstrap that admits a broader class of weight distributions for quantile regression. A simulation study on median regression is carried out to compare various bootstrap methods. With a simple finite-sample correction, the wild bootstrap is shown to account for general forms of heteroscedasticity in a regression model with fixed design points.
Bahadur representation; Heteroscedastic error; Quantile regression; Wild bootstrap
This paper considers the problem of estimation in a general semiparametric regression model when error-prone covariates are modeled parametrically while covariates measured without error are modeled nonparametrically. To account for the effects of measurement error, we apply a correction to a criterion function. The specific form of the correction proposed allows Monte Carlo simulations in problems for which the direct calculation of a corrected criterion is difficult. Therefore, in contrast to methods that require solving integral equations of possibly multiple dimensions, as in the case of multiple error-prone covariates, we propose methodology which offers a simple implementation. The resulting methods are functional, they make no assumptions about the distribution of the mismeasured covariates. We utilize profile kernel and backfitting estimation methods and derive the asymptotic distribution of the resulting estimators. Through numerical studies we demonstrate the applicability of proposed methods to Poisson, logistic and multivariate Gaussian partially linear models. We show that the performance of our methods is similar to a computationally demanding alternative. Finally, we demonstrate the practical value of our methods when applied to Nevada Test Site (NTS) Thyroid Disease Study data.
Generalized estimating equations; generalized linear mixed models; kernel method; measurement error; Monte Carlo Corrected Score; semiparametric regression
The use of the cumulative average model to investigate the association between disease incidence and repeated measurements of exposures in medical follow-up studies can be dated back to the 1960s (Kahn and Dawber, J Chron Dis 19:611–620, 1966). This model takes advantage of all prior data and thus should provide a statistically more powerful test of disease-exposure associations. Measurement error in covariates is common for medical follow-up studies. Many methods have been proposed to correct for measurement error. To the best of our knowledge, no methods have been proposed yet to correct for measurement error in the cumulative average model. In this article, we propose a regression calibration approach to correct relative risk estimates for measurement error. The approach is illustrated with data from the Nurses’ Health Study relating incident breast cancer between 1980 and 2002 to time-dependent measures of calorie-adjusted saturated fat intake, controlling for total caloric intake, alcohol intake, and baseline age.
Measurement error; Regression calibration; Nutritional data
We develop a novel empirical Bayesian framework for the semiparametric additive hazards regression model. The integrated likelihood, obtained by integration over the unknown prior of the nonparametric baseline cumulative hazard, can be maximized using standard statistical software. Unlike the corresponding full Bayes method, our empirical Bayes estimators of regression parameters, survival curves and their corresponding standard errors have easily computed closed-form expressions and require no elicitation of hyperparameters of the prior. The method guarantees a monotone estimator of the survival function and accommodates time-varying regression coefficients and covariates. To facilitate frequentist-type inference based on large-sample approximation, we present the asymptotic properties of the semiparametric empirical Bayes estimates. We illustrate the implementation and advantages of our methodology with a reanalysis of a survival dataset and a simulation study.
Gamma process; Integrated likelihood; Posterior process
Regression calibration (RC) is a popular method for estimating regression coefficients when one or more continuous explanatory variables, X, are measured with an error. In this method, the mismeasured covariate, W, is substituted by the expectation E(X|W), based on the assumption that the error in the measurement of X is non-differential. Using simulations, we compare three versions of RC with two other ‘substitution’ methods, moment reconstruction (MR) and imputation (IM), neither of which rely on the non-differential error assumption. We investigate studies that have an internal calibration sub-study. For RC, we consider (i) the usual version of RC, (ii) RC applied only to the ‘marker’ information in the calibration study, and (iii) an ‘efficient’ version (ERC) in which the estimators (i) and (ii) are combined. Our results show that ERC is preferable when there is non-differential measurement error. Under this condition, there are cases where ERC is less efficient than MR or IM, but they rarely occur in epidemiology. We show that the efficiency gain of usual RC and ERC over the other methods can sometimes be dramatic. The usual version of RC carries similar efficiency gains to ERC over MR and IM, but becomes unstable as measurement error becomes large, leading to bias and poor precision. When differential measurement error does pertain, then MR and IM have considerably less bias than RC, but can have much larger variance. We demonstrate our findings with an analysis of dietary fat intake and mortality in a large cohort study.
differential measurement error; moment reconstruction; multiple imputation; non-differential measurement error; regression calibration