We consider the problem of high-dimensional regression under non-constant error variances. Despite being a common phenomenon in biological applications, heteroscedasticity has, so far, been largely ignored in high-dimensional analysis of genomic data sets. We propose a new methodology that allows non-constant error variances for high-dimensional estimation and model selection. Our method incorporates heteroscedasticity by simultaneously modeling both the mean and variance components via a novel doubly regularized approach. Extensive Monte Carlo simulations indicate that our proposed procedure can result in better estimation and variable selection than existing methods when heteroscedasticity arises from the presence of predictors explaining error variances and outliers. Further, we demonstrate the presence of heteroscedasticity in and apply our method to an expression quantitative trait loci (eQTLs) study of 112 yeast segregants. The new procedure can automatically account for heteroscedasticity in identifying the eQTLs that are associated with gene expression variations and lead to smaller prediction errors. These results demonstrate the importance of considering heteroscedasticity in eQTL data analysis.
Generalized least squares; Heteroscedasticity; Large p small n; Model selection; Sparse regression; Variance estimation
Litter decomposition rate (k) is typically estimated from proportional litter mass loss data using models that assume constant, normally distributed errors. However, such data often show non-normal errors with reduced variance near bounds (0 or 1), potentially leading to biased k estimates. We compared the performance of nonlinear regression using the beta distribution, which is well-suited to bounded data and this type of heteroscedasticity, to standard nonlinear regression (normal errors) on simulated and real litter decomposition data. Although the beta model often provided better fits to the simulated data (based on the corrected Akaike Information Criterion, AICc), standard nonlinear regression was robust to violation of homoscedasticity and gave equally or more accurate k estimates as nonlinear beta regression. Our simulation results also suggest that k estimates will be most accurate when study length captures mid to late stage decomposition (50–80% mass loss) and the number of measurements through time is ≥5. Regression method and data transformation choices had the smallest impact on k estimates during mid and late stage decomposition. Estimates of k were more variable among methods and generally less accurate during early and end stage decomposition. With real data, neither model was predominately best; in most cases the models were indistinguishable based on AICc, and gave similar k estimates. However, when decomposition rates were high, normal and beta model k estimates often diverged substantially. Therefore, we recommend a pragmatic approach where both models are compared and the best is selected for a given data set. Alternatively, both models may be used via model averaging to develop weighted parameter estimates. We provide code to perform nonlinear beta regression with freely available software.
Regression calibration as a method for handling measurement error is becoming increasingly well-known and used in epidemiologic research. However, the standard version of the method is not appropriate for exposure analyzed on a categorical (e.g. quintile) scale, an approach commonly used in epidemiologic studies. A tempting solution could then be to use the predicted continuous exposure obtained through the regression calibration method and treat it as an approximation to the true exposure, that is, include the categorized calibrated exposure in the main regression analysis.
We use semi-analytical calculations and simulations to evaluate the performance of the proposed approach compared to the naive approach of not correcting for measurement error, in situations where analyses are performed on quintile scale and when incorporating the original scale into the categorical variables, respectively. We also present analyses of real data, containing measures of folate intake and depression, from the Norwegian Women and Cancer study (NOWAC).
In cases where extra information is available through replicated measurements and not validation data, regression calibration does not maintain important qualities of the true exposure distribution, thus estimates of variance and percentiles can be severely biased. We show that the outlined approach maintains much, in some cases all, of the misclassification found in the observed exposure. For that reason, regression analysis with the corrected variable included on a categorical scale is still biased. In some cases the corrected estimates are analytically equal to those obtained by the naive approach. Regression calibration is however vastly superior to the naive method when applying the medians of each category in the analysis.
Regression calibration in its most well-known form is not appropriate for measurement error correction when the exposure is analyzed on a percentile scale. Relating back to the original scale of the exposure solves the problem. The conclusion regards all regression models.
We investigate methods for regression analysis when covariates are measured with errors. In a subset of the whole cohort, a surrogate variable is available for the true unobserved exposure variable. The surrogate variable satisfies the classical measurement error model, but it may not have repeated measurements. In addition to the surrogate variables that are available among the subjects in the calibration sample, we assume that there is an instrumental variable (IV) that is available for all study subjects. An IV is correlated with the unobserved true exposure variable and hence can be useful in the estimation of the regression coefficients. We propose a robust best linear estimator that uses all the available data, which is the most efficient among a class of consistent estimators. The proposed estimator is shown to be consistent and asymptotically normal under very weak distributional assumptions. For Poisson or linear regression, the proposed estimator is consistent even if the measurement error from the surrogate or IV is heteroscedastic. Finite-sample performance of the proposed estimator is examined and compared with other estimators via intensive simulation studies. The proposed method and other methods are applied to a bladder cancer case–control study.
Calibration sample; Estimating equation; Heteroscedastic measurement error; Nonparametric correction
Toxicologists and pharmacologists often describe toxicity of a chemical using parameters of a nonlinear regression model. Thus estimation of parameters of a nonlinear regression model is an important problem. The estimates of the parameters and their uncertainty estimates depend upon the underlying error variance structure in the model. Typically, a priori the researcher would know if the error variances are homoscedastic (i.e., constant across dose) or if they are heteroscedastic (i.e., the variance is a function of dose). Motivated by this concern, in this article we introduce an estimation procedure based on preliminary test which selects an appropriate estimation procedure accounting for the underlying error variance structure. Since outliers and influential observations are common in toxicological data, the proposed methodology uses M-estimators. The asymptotic properties of the preliminary test estimator are investigated; in particular its asymptotic covariance matrix is derived. The performance of the proposed estimator is compared with several standard estimators using simulation studies. The proposed methodology is also illustrated using a data set obtained from the National Toxicology Program.
Asymptotic normality; Dose-response study; Heteroscedasticity; Hill model; M-estimation procedure; Preliminary test estimation; Toxicology
In a cocaine dependence treatment study, we use linear and nonlinear regression models to model posttreatment cocaine craving scores and first cocaine relapse time. A subset of the covariates are summary statistics derived from baseline daily cocaine use trajectories, such as baseline cocaine use frequency and average daily use amount. These summary statistics are subject to estimation error and can therefore cause biased estimators for the regression coefficients. Unlike classical measurement error problems, the error we encounter here is heteroscedastic with an unknown distribution, and there are no replicates for the error-prone variables or instrumental variables. We propose two robust methods to correct for the bias: a computationally efficient method-of-moments-based method for linear regression models and a subsampling extrapolation method that is generally applicable to both linear and nonlinear regression models. Simulations and an application to the cocaine dependence treatment data are used to illustrate the efficacy of the proposed methods. Asymptotic theory and variance estimation for the proposed subsampling extrapolation method and some additional simulation results are described in the online supplementary material.
Bias correction; Method-of-moments correction; Subsampling extrapolation
Preterm birth, defined as delivery before 37 completed weeks’ gestation, is a leading cause of infant morbidity and mortality. Identifying factors related to preterm delivery is an important goal of public health professionals who wish to identify etiologic pathways to target for prevention. Validation studies are often conducted in nutritional epidemiology in order to study measurement error in instruments that are generally less invasive or less expensive than ”gold standard” instruments. Data from such studies are then used in adjusting estimates based on the full study sample. However, measurement error in nutritional epidemiology has recently been shown to be complicated by correlated error structures in the study-wide and validation instruments. Investigators of a study of preterm birth and dietary intake designed a validation study to assess measurement error in a food frequency questionnaire (FFQ) administered during pregnancy and with the secondary goal of assessing whether a single administration of the FFQ could be used to describe intake over the relatively short pregnancy period, in which energy intake typically increases. Here, we describe a likelihood-based method via Markov Chain Monte Carlo to estimate the regression coefficients in a generalized linear model relating preterm birth to covariates, where one of the covariates is measured with error and the multivariate measurement error model has correlated errors among contemporaneous instruments (i.e. FFQs, 24-hour recalls, and/or biomarkers). Because of constraints on the covariance parameters in our likelihood, identifiability for all the variance and covariance parameters is not guaranteed and, therefore, we derive the necessary and suficient conditions to identify the variance and covariance parameters under our measurement error model and assumptions. We investigate the sensitivity of our likelihood-based model to distributional assumptions placed on the true folate intake by employing semi-parametric Bayesian methods through the mixture of Dirichlet process priors framework. We exemplify our methods in a recent prospective cohort study of risk factors for preterm birth. We use long-term folate as our error-prone predictor of interest, the food-frequency questionnaire (FFQ) and 24-hour recall as two biased instruments, and serum folate biomarker as the unbiased instrument. We found that folate intake, as measured by the FFQ, led to a conservative estimate of the estimated odds ratio of preterm birth (0.76) when compared to the odds ratio estimate from our likelihood-based approach, which adjusts for the measurement error (0.63). We found that our parametric model led to similar conclusions to the semi-parametric Bayesian model.
Adaptive-Rejection Sampling; Dirichlet process prior; MCMC; Semiparametric Bayes
Numerous regression approaches to isotherm parameters estimation appear in the literature. The real insight into the proper modeling pattern can be achieved only by testing methods on a very big number of cases. Experimentally, it cannot be done in a reasonable time, so the Monte Carlo simulation method was applied. The objective of this paper is to introduce and compare numerical approaches that involve different levels of knowledge about the noise structure of the analytical method used for initial and equilibrium concentration determination. Six levels of homoscedastic noise and five types of heteroscedastic noise precision models were considered. Performance of the methods was statistically evaluated based on median percentage error and mean absolute relative error in parameter estimates. The present study showed a clear distinction between two cases. When equilibrium experiments are performed only once, for the homoscedastic case, the winning error function is ordinary least squares, while for the case of heteroscedastic noise the use of orthogonal distance regression or Margart's percent standard deviation is suggested. It was found that in case when experiments are repeated three times the simple method of weighted least squares performed as well as more complicated orthogonal distance regression method.
Regression calibration (RC) is a popular method for estimating regression coefficients when one or more continuous explanatory variables, X, are measured with an error. In this method, the mismeasured covariate, W, is substituted by the expectation E(X|W), based on the assumption that the error in the measurement of X is non-differential. Using simulations, we compare three versions of RC with two other ‘substitution’ methods, moment reconstruction (MR) and imputation (IM), neither of which rely on the non-differential error assumption. We investigate studies that have an internal calibration sub-study. For RC, we consider (i) the usual version of RC, (ii) RC applied only to the ‘marker’ information in the calibration study, and (iii) an ‘efficient’ version (ERC) in which the estimators (i) and (ii) are combined. Our results show that ERC is preferable when there is non-differential measurement error. Under this condition, there are cases where ERC is less efficient than MR or IM, but they rarely occur in epidemiology. We show that the efficiency gain of usual RC and ERC over the other methods can sometimes be dramatic. The usual version of RC carries similar efficiency gains to ERC over MR and IM, but becomes unstable as measurement error becomes large, leading to bias and poor precision. When differential measurement error does pertain, then MR and IM have considerably less bias than RC, but can have much larger variance. We demonstrate our findings with an analysis of dietary fat intake and mortality in a large cohort study.
differential measurement error; moment reconstruction; multiple imputation; non-differential measurement error; regression calibration
Spatial data with covariate measurement errors have been commonly observed in public health studies. Existing work mainly concentrates on parameter estimation using Gibbs sampling, and no work has been conducted to understand and quantify the theoretical impact of ignoring measurement error on spatial data analysis in the form of the asymptotic biases in regression coefficients and variance components when measurement error is ignored. Plausible implementations, from frequentist perspectives, of maximum likelihood estimation in spatial covariate measurement error models are also elusive. In this paper, we propose a new class of linear mixed models for spatial data in the presence of covariate measurement errors. We show that the naive estimators of the regression coefficients are attenuated while the naive estimators of the variance components are inflated, if measurement error is ignored. We further develop a structural modeling approach to obtaining the maximum likelihood estimator by accounting for the measurement error. We study the large sample properties of the proposed maximum likelihood estimator, and propose an EM algorithm to draw inference. All the asymptotic properties are shown under the increasing-domain asymptotic framework. We illustrate the method by analyzing the Scottish lip cancer data, and evaluate its performance through a simulation study, all of which elucidate the importance of adjusting for covariate measurement errors.
Measurement error; Spatial data; Structural modeling; Variance components; Asymptotic bias; Consistency and asymptotic normality; Increasing domain asymptotics; EM algorithm
Online hearing tests conducted in home settings on a personal computer (PC) require prior calibration. Biological calibration consists of approximating the reference sound level via the hearing threshold of a person with normal hearing.
The objective of this study was to identify the error of the proposed methods of biological calibration, their duration, and the subjective difficulty in conducting these tests via PC.
Seven methods have been proposed for measuring the calibration coefficients. All measurements were performed in reference to the hearing threshold of a normal-hearing person. Three methods were proposed for determining the reference sound level on the basis of these calibration coefficients. Methods were compared for the estimated error, duration, and difficulty of the calibration. Web-based self-assessed measurements of the calibration coefficients were carried out in 3 series: (1) at a otolaryngology clinic, (2) at the participant’s home, and (3) again at the clinic. Additionally, in series 1 and 3, pure-tone audiometry was conducted and series 3 was followed by an offline questionnaire concerning the difficulty of the calibration. Participants were recruited offline from coworkers of the Department and Clinic of Otolaryngology, Wroclaw Medical University, Poland.
All 25 participants, aged 22-35 years (median 27) completed all tests and filled in the questionnaire. The smallest standard deviation of the calibration coefficient in the test-retest measurement was obtained at the level of 3.87 dB (95% CI 3.52-4.29) for the modulated signal presented in accordance with the rules of Bekesy’s audiometry. The method is characterized by moderate duration time and a relatively simple procedure. The simplest and shortest method was the method of self-adjustment of the sound volume to the barely audible level. In the test-retest measurement, the deviation of this method equaled 4.97 dB (95% CI 4.53-5.51). Among methods determining the reference sound level, the levels determined independently for each frequency revealed the smallest error. The estimated standard deviations of the difference in the hearing threshold between the examination conducted on a biologically calibrated PC and pure-tone audiometry varied from 7.27 dB (95% CI 6.71-7.93) to 10.38 dB (95% CI 9.11-12.03), depending on the calibration method.
In this study, an analysis of biological calibration was performed and the presented results included calibration error, calibration time, and calibration difficulty. These values determine potential applications of Web-based hearing tests conducted in home settings and are decisive factors when selecting the calibration method. If there are no substantial time limitations, it is advisable to use Bekesy method and determine the reference sound level independently at each frequency because this approach is characterized by the lowest error.
pure-tone audiometry; computer-assisted instruction; self-examination
Motivation: Immunoassays are primary diagnostic and research tools throughout the medical and life sciences. The common approach to the processing of immunoassay data involves estimation of the calibration curve followed by inversion of the calibration function to read off the concentration estimates. This approach, however, does not lend itself easily to acceptable estimation of confidence limits on the estimated concentrations. Such estimates must account for uncertainty in the calibration curve as well as uncertainty in the target measurement. Even point estimates can be problematic: because of the non-linearity of calibration curves and error heteroscedasticity, the neglect of components of measurement error can produce significant bias.
Methods: We have developed a Bayesian approach for the estimation of concentrations from immunoassay data that treats the propagation of measurement error appropriately. The method uses Markov Chain Monte Carlo (MCMC) to approximate the posterior distribution of the target concentrations and numerically compute the relevant summary statistics. Software implementing the method is freely available for public use.
Results: The new method was tested on both simulated and experimental datasets with different measurement error models. The method outperformed the common inverse method on samples with large measurement errors. Even in cases with extreme measurements where the common inverse method failed, our approach always generated reasonable estimates for the target concentrations.
Availability: Project name: Baecs; Project home page: www.computationalimmunology.org/utilities/; Operating systems: Linux, MacOS X and Windows; Programming language: C++; License: Free for Academic Use.
Supplementary data are available at Bioinformatics online.
Measurement error is common in epidemiological and biomedical studies. When biomarkers are measured in batches or groups, measurement error is potentially correlated within each batch or group. In regression analysis, most existing methods are not applicable in the presence of batch-specific measurement error in predictors. We propose a robust conditional likelihood approach to account for batch-specific error in predictors when batch effect is additive and the predominant source of error, which requires no assumptions on the distribution of measurement error. While a regression model with batch as a categorical covariable yields the same parameter estimates as the proposed conditional likelihood approach for linear regression, this result does not hold in general for all generalized linear models, in particular, logistic regression. Our simulation studies show that the conditional likelihood approach achieves better finite sample performance than the regression calibration approach or a naive approach without adjustment for measurement error. In the case of logistic regression, our proposed approach is shown to also outperform the regression approach with batch as a categorical covariate. In addition, we also examine a “hybrid” approach combining the conditional likelihood method and the regression calibration method, which is shown in simulations to achieve good performance in the presence of both batch-specific and measurement-specific error. We illustrate our method using data from a colorectal adenoma study.
Batch-specific error; Biomarker; Conditional likelihood; Exponential family; Generalized linear models; Robust method
Occupational, environmental, and nutritional epidemiologists are often interested in estimating the prospective effect of time-varying exposure variables such as cumulative exposure or cumulative updated average exposure, in relation to chronic disease endpoints such as cancer incidence and mortality. From exposure validation studies, it is apparent that many of the variables of interest are measured with moderate to substantial error. Although the ordinary regression calibration approach is approximately valid and efficient for measurement error correction of relative risk estimates from the Cox model with time-independent point exposures when the disease is rare, it is not adaptable for use with time-varying exposures. By re-calibrating the measurement error model within each risk set, a risk set regression calibration method is proposed for this setting. An algorithm for a bias-corrected point estimate of the relative risk using an RRC approach is presented, followed by the derivation of an estimate of its variance, resulting in a sandwich estimator. Emphasis is on methods applicable to the main study/external validation study design, which arises in important applications. Simulation studies under several assumptions about the error model were carried out, which demonstrated the validity and efficiency of the method in finite samples. The method was applied to a study of diet and cancer from Harvard’s Health Professionals Follow-up Study (HPFS).
Cox proportional hazards model; Measurement error; Risk set regression calibration; Time-varying covariates
Audits are often performed to assess the quality of clinical trial data, but beyond detecting fraud or sloppiness, the audit data is generally ignored. In earlier work using data from a non-randomized study, Shepherd and Yu (2011) developed statistical methods to incorporate audit results into study estimates, and demonstrated that audit data could be used to eliminate bias.
In this manuscript we examine the usefulness of audit-based error-correction methods in clinical trial settings where a continuous outcome is of primary interest.
We demonstrate the bias of multiple linear regression estimates in general settings with an outcome that may have errors and a set of covariates for which some may have errors and others, including treatment assignment, are recorded correctly for all subjects. We study this bias under different assumptions including independence between treatment assignment, covariates, and data errors (conceivable in a double-blinded randomized trial) and independence between treatment assignment and covariates but not data errors (possible in an unblinded randomized trial). We review moment-based estimators to incorporate the audit data and propose new multiple imputation estimators. The performance of estimators is studied in simulations.
When treatment is randomized and unrelated to data errors, estimates of the treatment effect using the original error-prone data (i.e., ignoring the audit results) are unbiased. In this setting, both moment and multiple imputation estimators incorporating audit data are more variable than standard analyses using the original data. In contrast, in settings where treatment is randomized but correlated with data errors and in settings where treatment is not randomized, standard treatment effect estimates will be biased. And in all settings, parameter estimates for the original, error-prone covariates will be biased. Treatment and covariate effect estimates can be corrected by incorporating audit data using either the multiple imputation or moment-based approaches. Bias, precision, and coverage of confidence intervals improve as the audit size increases.
The extent of bias and the performance of methods depend on the extent and nature of the error as well as the size of the audit. This work only considers methods for the linear model. Settings much different than those considered here need further study.
In randomized trials with continuous outcomes and treatment assignment independent of data errors, standard analyses of treatment effects will be unbiased and are recommended. However, if treatment assignment is correlated with data errors or other covariates, naive analyses may be biased. In these settings, and when covariate effects are of interest, approaches for incorporating audit results should be considered.
audit; bias; clinical trials; measurement error; multiple imputation
The basis for this study is the fact that instrument error increases the variance of the distribution of body mass index (BMI). Combined with a defined cut-off value this may impact upon the estimated proportion of overweight and obesity. It is important to ensure high quality surveillance data in order to follow trends of estimated prevalence of overweight and obesity. The purpose of the study was to assess the impact of instrument error, due to uncalibrated scales and stadiometers, on prevalence estimates of overweight and obesity.
Anthropometric measurements from a nationally representative sample were used; the Norwegian Child Growth study (NCG) of 3474 children. Each of the 127 participating schools received a reference weight and a reference length to determine the correction value. Correction value corresponds to instrument error and is the difference between the true value and the measured, uncorrected weight and height at local scales and stadiometers. Simulations were used to determine the expected implications of instrument errors. To systematically investigate this, the coefficient of variation (CV) of instrument error was used in the simulations and was increased successively.
Simulations showed that the estimated prevalence of overweight and obesity increased systematically with the size of instrument error when the mean instrument error was zero. The estimated prevalence was 16.4% with no instrument error and was, on average, overestimated by 0.5 percentage points based on observed variance of instrument error from the NCG-study. Further, the estimated prevalence was 16.7% with 1% CV of instrument error, and increased to 17.8%, 19.5% and 21.6% with 2%, 3% and 4% CV of instrument error, respectively.
Failure to calibrate measuring instruments is likely to lead to overestimation of the prevalence of overweight and obesity in population-based surveys.
Instrument error; Calibration; Anthropometry; Weights and measures; Obesity; Overweight; Epidemiology
Summary. It is widely believed that risks of many complex diseases are determined by genetic susceptibilities, environmental exposures, and their interaction. Chatterjee and Carroll (2005, Biometrika 92, 399–418) developed an efficient retrospective maximum-likelihood method for analysis of case–control studies that exploits an assumption of gene–environment independence and leaves the distribution of the environmental covariates to be completely nonparametric. Spinka, Carroll, and Chatterjee (2005, Genetic Epidemiology 29, 108–127) extended this approach to studies where certain types of genetic information, such as haplotype phases, may be missing on some subjects. We further extend this approach to situations when some of the environmental exposures are measured with error. Using a polychotomous logistic regression model, we allow disease status to have K + 1 levels. We propose use of a pseudolikelihood and a related EM algorithm for parameter estimation. We prove consistency and derive the resulting asymptotic covariance matrix of parameter estimates when the variance of the measurement error is known and when it is estimated using replications. Inferences with measurement error corrections are complicated by the fact that the Wald test often behaves poorly in the presence of large amounts of measurement error. The likelihood-ratio (LR) techniques are known to be a good alternative. However, the LR tests are not technically correct in this setting because the likelihood function is based on an incorrect model, i.e., a prospective model in a retrospective sampling scheme. We corrected standard asymptotic results to account for the fact that the LR test is based on a likelihood-type function. The performance of the proposed method is illustrated using simulation studies emphasizing the case when genetic information is in the form of haplotypes and missing data arises from haplotype-phase ambiguity. An application of our method is illustrated using a population-based case–control study of the association between calcium intake and the risk of colorectal adenoma.
EM algorithm; Errors in variables; Gene-environment independence; Gene-environment interactions; Likelihood-ratio tests in misspecified models; Inferences in measurement error models; Profile likelihood; Semiparametric methods
AIMS: Errors in reporting International Normalised Ratios (INR) may be corrected by assignment of a System International Sensitivity Index (System ISI). This 57 centre study tests the validity of several procedures for INR correction. METHODS: Prothrombin times of eight lyophilised coumarin calibrants, a lyophilised normal pool calibrant, and eight frozen coumarin plasmas were determined at each centre. The calibrants were calibrated using international reference preparations. The eight frozen coumarin plasmas were calibrated in a four centre international exercise. The relations tested were: (a) the logarithm of local prothrombin time against the logarithm of reference prothrombin time; (b) reference INR against local prothrombin time; and (c) logarithm of reference INR against logarithm of local prothrombin time. These methods were analysed by both linear and orthogonal regression. RESULTS: All system groups required correction, the mean percentage deviation of the uncorrected data from the calibrated values was 19.0%. There was also considerable variation in INR, with the coefficient of variance (CV) ranging from 11.30 to 17.29%. Correction of INR was possible with all methods (CV reduced to < 7%). However, only when a plot of the logarithm of local prothrombin time against the logarithm of reference prothrombin time was fitted by orthogonal regression, or a plot of logarithm of reference INR against logarithm of local prothrombin time was fitted by either type of regression analysis, did the best fit line through the calibrant plasmas also pass close to the local mean normal prothrombin time. CONCLUSIONS: While INR correction may be achieved by all the above methods, that relating log reference INR to log local prothrombin time by linear regression analysis is the simplest to perform.
Multivariate local polynomial fitting is applied to the multivariate linear heteroscedastic regression model. Firstly, the local polynomial fitting is applied to estimate heteroscedastic function, then the coefficients of regression model are obtained by using generalized least squares method. One noteworthy feature of our approach is that we avoid the testing for heteroscedasticity by improving the traditional two-stage method. Due to non-parametric technique of local polynomial estimation, it is unnecessary to know the form of heteroscedastic function. Therefore, we can improve the estimation precision, when the heteroscedastic function is unknown. Furthermore, we verify that the regression coefficients is asymptotic normal based on numerical simulations and normal Q-Q plots of residuals. Finally, the simulation results and the local polynomial estimation of real data indicate that our approach is surely effective in finite-sample situations.
New technology introduced over time results in changes in densitometers during longitudinal studies of bone mineral density (BMD). This requires that a cross-calibration process be completed to translate measurements from the old densitometer to the new one. Previously described cross-calibration methods for research settings have collected single measures on each densitometer and used linear regression to estimate cross-calibration corrections. Thus, these methods may produce corrections that have limited precision and underestimate the variability in converted BMD values. Furthermore, most prior studies have included small samples recruited from specialized populations. Increasing the sample size, obtaining multiple measures on each machine, and utilizing linear mixed models to account for between- and within-subject variability may improve cross-calibration estimates. The purpose of this study was to conduct an in vivo cross-calibration of a Lunar DPX-L with a Lunar Prodigy densitometer using a sample of 249 healthy volunteers who were scanned twice on each densitometer, without repositioning, at both the femur and spine. Scans were analyzed using both automated and manual placement of regions of interest. Wilcoxon rank-sum tests and Bland-Altman plots were used to examine possible differences between repeat scans within and across densitometers. We used linear mixed models to determine the cross-calibration equations for the femoral neck, trochanter, total hip and lumbar spine (L2-L4) regions. Results using automated and manual placement of the regions of interest did not differ significantly The DPX–L exhibited larger median absolute differences in repeat scans for femoral neck [0.016 vs. 0.012, p=0.1] and trochanter [0.011 vs. 0.009, p=0.06] BMD values compared to the Prodigy. The Bland-Altman plots revealed no statistically significant linear relation between the difference in paired measures between machines and mean BMD. In our large sample of healthy volunteers we did detect systematic differences between the DPX-L and Prodigy densitometers. Our proposed cross-calibration method, which includes acquiring multiple measures and using linear mixed models, provides researchers with a more realistic estimate of the variance of cross-calibrated BMD measures, potentially reducing the chance of making a type I error in longitudinal studies of changes in BMD.
cross-calibration; densitometer; bone mineral density; DXA; mixed models; Framingham Osteoporosis Study
We consider Cox proportional hazards regression when the covariate vector includes error-prone discrete covariates along with error-free covariates, which may be discrete or continuous. The misclassification in the discrete error-prone covariates is allowed to be of any specified form. Building on the work of Nakamura and his colleagues, we present a corrected score method for this setting. The method can handle all three major study designs (internal validation design, external validation design, and replicate measures design), both functional and structural error models, and time-dependent covariates satisfying a certain ‘localized error’ condition. We derive the asymptotic properties of the method and indicate how to adjust the covariance matrix of the regression coefficient estimates to account for estimation of the misclassification matrix. We present the results of a finite-sample simulation study under Weibull survival with a single binary covariate having known misclassification rates. The performance of the method described here was similar to that of related methods we have examined in previous works. Specifically, our new estimator performed as well as or, in a few cases, better than the full Weibull maximum likelihood estimator. We also present simulation results for our method for the case where the misclassification probabilities are estimated from an external replicate measures study. Our method generally performed well in these simulations. The new estimator has a broader range of applicability than many other estimators proposed in the literature, including those described in our own earlier work, in that it can handle time-dependent covariates with an arbitrary misclassification structure. We illustrate the method on data from a study of the relationship between dietary calcium intake and distal colon cancer.
errors in variables; nonlinear models; proportional hazards
The standard methods for regression analyses of clustered riverine larval habitat data of Simulium damnosum s.l. a major black-fly vector of Onchoceriasis, postulate models relating observational ecological-sampled parameter estimators to prolific habitats without accounting for residual intra-cluster error correlation effects. Generally, this correlation comes from two sources: (1) the design of the random effects and their assumed covariance from the multiple levels within the regression model; and, (2) the correlation structure of the residuals. Unfortunately, inconspicuous errors in residual intra-cluster correlation estimates can overstate precision in forecasted S.damnosum s.l. riverine larval habitat explanatory attributes regardless how they are treated (e.g., independent, autoregressive, Toeplitz, etc). In this research, the geographical locations for multiple riverine-based S. damnosum s.l. larval ecosystem habitats sampled from 2 pre-established epidemiological sites in Togo were identified and recorded from July 2009 to June 2010. Initially the data was aggregated into proc genmod. An agglomerative hierarchical residual cluster-based analysis was then performed. The sampled clustered study site data was then analyzed for statistical correlations using Monthly Biting Rates (MBR). Euclidean distance measurements and terrain-related geomorphological statistics were then generated in ArcGIS. A digital overlay was then performed also in ArcGIS using the georeferenced ground coordinates of high and low density clusters stratified by Annual Biting Rates (ABR). This data was overlain onto multitemporal sub-meter pixel resolution satellite data (i.e., QuickBird 0.61m wavbands ). Orthogonal spatial filter eigenvectors were then generated in SAS/GIS. Univariate and non-linear regression-based models (i.e., Logistic, Poisson and Negative Binomial) were also employed to determine probability distributions and to identify statistically significant parameter estimators from the sampled data. Thereafter, Durbin-Watson test statistics were used to test the null hypothesis that the regression residuals were not autocorrelated against the alternative that the residuals followed an autoregressive process in AUTOREG. Bayesian uncertainty matrices were also constructed employing normal priors for each of the sampled estimators in PROC MCMC. The residuals revealed both spatially structured and unstructured error effects in the high and low ABR-stratified clusters. The analyses also revealed that the estimators, levels of turbidity and presence of rocks were statistically significant for the high-ABR-stratified clusters, while the estimators distance between habitats and floating vegetation were important for the low-ABR-stratified cluster. Varying and constant coefficient regression models, ABR- stratified GIS-generated clusters, sub-meter resolution satellite imagery, a robust residual intra-cluster diagnostic test, MBR-based histograms, eigendecomposition spatial filter algorithms and Bayesian matrices can enable accurate autoregressive estimation of latent uncertainity affects and other residual error probabilities (i.e., heteroskedasticity) for testing correlations between georeferenced S. damnosum s.l. riverine larval habitat estimators. The asymptotic distribution of the resulting residual adjusted intra-cluster predictor error autocovariate coefficients can thereafter be established while estimates of the asymptotic variance can lead to the construction of approximate confidence intervals for accurately targeting productive S. damnosum s.l habitats based on spatiotemporal field-sampled count data.
Simulium damnosum s.l.; cluster covariates; QuickBird; onchoceriasis; annual biting rates; Bayesian; Togo
Data from many scientific areas often come with measurement error. Density or distribution function estimation from contaminated data and nonparametric regression with errors-in-variables are two important topics in measurement error models. In this paper, we present a new software package decon for
R, which contains a collection of functions that use the deconvolution kernel methods to deal with the measurement error problems. The functions allow the errors to be either homoscedastic or heteroscedastic. To make the deconvolution estimators computationally more efficient in
R, we adapt the fast Fourier transform algorithm for density estimation with error-free data to the deconvolution kernel estimation. We discuss the practical selection of the smoothing parameter in deconvolution methods and illustrate the use of the package through both simulated and real examples.
measurement error models; deconvolution; errors-in-variables problems; smoothing; kernel; faster Fourier transform; heteroscedastic errors; bandwidth selection
The statistical analysis of immunological data may be complicated because precise quantitative levels cannot always be determined. Values below a given detection limit may not be observed (nondetects), and data with nondetects are called left-censored. Since nondetects cannot be considered as missing at random, a statistician faced with data containing these nondetects must decide how to combine nondetects with detects. Till now, the common practice is to impute each nondetect with a single value such as a half of the detection limit, and to conduct ordinary regression analysis. The first aim of this paper is to give an overview of methods to analyze, and to provide new methods handling censored data other than an (ordinary) linear regression. The second aim is to compare these methods by simulation studies based on real data.
We compared six new and existing methods: deletion of nondetects, single substitution, extrapolation by regression on order statistics, multiple imputation using maximum likelihood estimation, tobit regression, and logistic regression. The deletion and extrapolation by regression on order statistics methods gave biased parameter estimates. The single substitution method underestimated variances, and logistic regression suffered loss of power. Based on simulation studies, we found that tobit regression performed well when the proportion of nondetects was less than 30%, and that taken together the multiple imputation method performed best.
Based on simulation studies, the newly developed multiple imputation method performed consistently well under different scenarios of various proportion of nondetects, sample sizes and even in the presence of heteroscedastic errors.
When study data are clustered, standard regression analysis is considered inappropriate and analytical techniques for clustered data need to be used. For prediction research in which the interest of predictor effects is on the patient level, random effect regression models are probably preferred over standard regression analysis. It is well known that the random effect parameter estimates and the standard logistic regression parameter estimates are different. Here, we compared random effect and standard logistic regression models for their ability to provide accurate predictions.
Using an empirical study on 1642 surgical patients at risk of postoperative nausea and vomiting, who were treated by one of 19 anesthesiologists (clusters), we developed prognostic models either with standard or random intercept logistic regression. External validity of these models was assessed in new patients from other anesthesiologists. We supported our results with simulation studies using intra-class correlation coefficients (ICC) of 5%, 15%, or 30%. Standard performance measures and measures adapted for the clustered data structure were estimated.
The model developed with random effect analysis showed better discrimination than the standard approach, if the cluster effects were used for risk prediction (standard c-index of 0.69 versus 0.66). In the external validation set, both models showed similar discrimination (standard c-index 0.68 versus 0.67). The simulation study confirmed these results. For datasets with a high ICC (≥15%), model calibration was only adequate in external subjects, if the used performance measure assumed the same data structure as the model development method: standard calibration measures showed good calibration for the standard developed model, calibration measures adapting the clustered data structure showed good calibration for the prediction model with random intercept.
The models with random intercept discriminate better than the standard model only if the cluster effect is used for predictions. The prediction model with random intercept had good calibration within clusters.
Logistic regression analysis; Prediction model with random intercept; Validation