In the United States the preferred method of obtaining dietary intake data is the 24-hour dietary recall, yet the measure of most interest is usual or long-term average daily intake, which is impossible to measure. Thus, usual dietary intake is assessed with considerable measurement error. We were interested in estimating the population distribution of the Healthy Eating Index-2005 (HEI-2005), a multi-component dietary quality index involving ratios of interrelated dietary components to energy, among children aged 2-8 in the United States, using a national survey and incorporating survey weights. We developed a highly nonlinear, multivariate zero-inflated data model with measurement error to address this question. Standard nonlinear mixed model software such as SAS NLMIXED cannot handle this problem. We found that taking a Bayesian approach, and using MCMC, resolved the computational issues and doing so enabled us to provide a realistic distribution estimate for the HEI-2005 total score. While our computation and thinking in solving this problem was Bayesian, we relied on the well-known close relationship between Bayesian posterior means and maximum likelihood, the latter not computationally feasible, and thus were able to develop standard errors using balanced repeated replication, a survey-sampling approach.
Bayesian methods; Dietary assessment; Latent variables; Measurement error; Mixed models; Nutritional epidemiology; Nutritional surveillance; Zero-Inflated Data
Statistical tables are an essential component of scientific papers and reports in biomedical and agricultural sciences. Measurements in these tables are summarized as mean ± SEM for each treatment group. Results from pairwise-comparison tests are often included using letter displays, in which treatment means that are not significantly different, are followed by a common letter. However, the traditional manual processes for computation and presentation of statistically significant outcomes in MS Word tables using a letter-based algorithm are tedious and prone to errors.
Using the R package ‘Shiny’, we present a web-based program freely available online, at https://houssein-assaad.shinyapps.io/TwoWayANOVA/. No download is required. The program is capable of rapidly generating publication-ready tables containing two-way analysis of variance (ANOVA) results. Additionally, the software can perform multiple comparisons of means using the Duncan, Student-Newman-Keuls, Tukey Kramer, Westfall, and Fisher’s least significant difference (LSD) tests. If the LSD test is selected, multiple methods (e.g., Bonferroni and Holm) are available for adjusting p-values. Significance statements resulting from all pairwise comparisons are included in the table using the popular letter display algorithm. With the application of our software, the procedures of ANOVA can be completed within seconds using a web-browser, preferably Mozilla Firefox or Google Chrome, and a few mouse clicks. To our awareness, none of the currently available commercial (e.g., Stata, SPSS and SAS) or open-source software (e.g., R and Python) can perform such a rapid task without advanced knowledge of the corresponding programming language.
The new and user-friendly program described in this paper should help scientists perform statistical analysis and rapidly generate publication-ready MS-Word tables for two-way ANOVA. Our software is expected to facilitate research in agriculture, biomedicine, and other fields of life sciences.
Two-way ANOVA; Multiple comparisons; Online software; Statistical analysis; Biology; Agriculture; R; Shiny
Functional principal component analysis (FPCA) has become the most widely used dimension reduction tool for functional data analysis. We consider functional data measured at random, subject-specific time points, contaminated with measurement error, allowing for both sparse and dense functional data, and propose novel information criteria to select the number of principal component in such data. We propose a Bayesian information criterion based on marginal modeling that can consistently select the number of principal components for both sparse and dense functional data. For dense functional data, we also developed an Akaike information criterion (AIC) based on the expected Kullback-Leibler information under a Gaussian assumption. In connecting with factor analysis in multivariate time series data, we also consider the information criteria by Bai & Ng (2002) and show that they are still consistent for dense functional data, if a prescribed undersmoothing scheme is undertaken in the FPCA algorithm. We perform intensive simulation studies and show that the proposed information criteria vastly outperform existing methods for this type of data. Surprisingly, our empirical evidence shows that our information criteria proposed for dense functional data also perform well for sparse functional data. An empirical example using colon carcinogenesis data is also provided to illustrate the results.
Akaike information criterion; Bayesian information criterion; Functional data analysis; Kernel smoothing; Principal components
Primary analysis of case–control studies focuses on the relationship between disease D and a set of covariates of interest (Y, X). A secondary application of the case–control study, which is often invoked in modern genetic epidemiologic association studies, is to investigate the interrelationship between the covariates themselves. The task is complicated owing to the case–control sampling, where the regression of Y on X is different from what it is in the population. Previous work has assumed a parametric distribution for Y given X and derived semiparametric efficient estimation and inference without any distributional assumptions about X. We take up the issue of estimation of a regression function when Y given X follows a homoscedastic regression model, but otherwise the distribution of Y is unspecified. The semiparametric efficient approaches can be used to construct semiparametric efficient estimates, but they suffer from a lack of robustness to the assumed model for Y given X. We take an entirely different approach. We show how to estimate the regression parameters consistently even if the assumed model for Y given X is incorrect, and thus the estimates are model robust. For this we make the assumption that the disease rate is known or well estimated. The assumption can be dropped when the disease is rare, which is typically so for most case–control studies, and the estimation algorithm simplifies. Simulations and empirical examples are used to illustrate the approach.
Biased samples; Homoscedastic regression; Secondary data; Secondary phenotypes; Semiparametric inference; Two-stage samples
Multiple Indicators, Multiple Causes Models (MIMIC) are often employed by researchers studying the effects of an unobservable latent variable on a set of outcomes, when causes of the latent variable are observed. There are times however when the causes of the latent variable are not observed because measurements of the causal variable are contaminated by measurement error. The objectives of this paper are: (1) to develop a novel model by extending the classical linear MIMIC model to allow both Berkson and classical measurement errors, defining the MIMIC measurement error (MIMIC ME) model, (2) to develop likelihood based estimation methods for the MIMIC ME model, (3) to apply the newly defined MIMIC ME model to atomic bomb survivor data to study the impact of dyslipidemia and radiation dose on the physical manifestations of dyslipidemia. As a by-product of our work, we also obtain a data-driven estimate of the variance of the classical measurement error associated with an estimate of the amount of radiation dose received by atomic bomb survivors at the time of their exposure.
Atomic bomb survivor data; Berkson error; Dyslipidemia; Instrumental Variables; Latent variables; Measurement error; MIMIC models
We consider the problem of estimating the density of a random variable when precise measurements on the variable are not available, but replicated proxies contaminated with measurement error are available for sufficiently many subjects. Under the assumption of additive measurement errors this reduces to a problem of deconvolution of densities. Deconvolution methods often make restrictive and unrealistic assumptions about the density of interest and the distribution of measurement errors, e.g., normality and homoscedasticity and thus independence from the variable of interest. This article relaxes these assumptions and introduces novel Bayesian semiparametric methodology based on Dirichlet process mixture models for robust deconvolution of densities in the presence of conditionally heteroscedastic measurement errors. In particular, the models can adapt to asymmetry, heavy tails and multimodality. In simulation experiments, we show that our methods vastly outperform a recent Bayesian approach based on estimating the densities via mixtures of splines. We apply our methods to data from nutritional epidemiology. Even in the special case when the measurement errors are homoscedastic, our methodology is novel and dominates other methods that have been proposed previously. Additional simulation results, instructions on getting access to the data set and R programs implementing our methods are included as part of online supplemental materials.
B-spline; Conditional heteroscedasticity; Density deconvolution; Dirichlet process mixture models; Measurement errors; Skew-normal distribution; Variance function
The data functions that are studied in the course of functional data
analysis are assembled from discrete data, and the level of smoothing that is
used is generally that which is appropriate for accurate approximation of the
conceptually smooth functions that were not actually observed. Existing
literature shows that this approach is effective, and even optimal, when using
functional data methods for prediction or hypothesis testing. However, in the
present paper we show that this approach is not effective in classification
problems. There a useful rule of thumb is that undersmoothing is often
desirable, but there are several surprising qualifications to that approach.
First, the effect of smoothing the training data can be more significant than
that of smoothing the new data set to be classified; second, undersmoothing is
not always the right approach, and in fact in some cases using a relatively
large bandwidth can be more effective; and third, these perverse results are the
consequence of very unusual properties of error rates, expressed as functions of
smoothing parameters. For example, the orders of magnitude of optimal smoothing
parameter choices depend on the signs and sizes of terms in an expansion of
error rate, and those signs and sizes can vary dramatically from one setting to
another, even for the same classifier.
Centroid method; discrimination; kernel smoothing; quadratic discrimination; smoothing parameter choice; training data
The Lasso shrinkage procedure achieved its popularity, in part, by its tendency to shrink estimated coefficients to zero, and its ability to serve as a variable selection procedure. Using data-adaptive weights, the adaptive Lasso modified the original procedure to increase the penalty terms for those variables estimated to be less important by ordinary least squares. Although this modified procedure attained the oracle properties, the resulting models tend to include a large number of “false positives” in practice. Here, we adapt the concept of local false discovery rates (lFDRs) so that it applies to the sequence, λn, of smoothing parameters for the adaptive Lasso. We define the lFDR for a given λn to be the probability that the variable added to the model by decreasing λn to λn−δ is not associated with the outcome, where δ is a small value. We derive the relationship between the lFDR and λn, show lFDR=1 for traditional smoothing parameters, and show how to select λn so as to achieve a desired lFDR. We compare the smoothing parameters chosen to achieve a specified lFDR and those chosen to achieve the oracle properties, as well as their resulting estimates for model coefficients, with both simulation and an example from a genetic study of prostate specific antigen.
Adaptive Lasso; Local false discovery rate; Smoothing parameter; Variable selection
Male Zucker diabetic fatty (ZDF) rats were used to study effects of oral administration of interferon tau (IFNT) in reducing obesity. Eighteen ZDF rats (28 days of age) were assigned randomly to receive 0, 4 or 8 μg IFNT/kg body weight (BW) per day (n=6/group) for 8 weeks. Water consumption was measured every two days. Food intake and BW were recorded weekly. Energy expenditure in 4-, 6-, 8-, and 10-week-old rats was determined using indirect calorimetry. Starting at 7 weeks of age, urinary glucose and ketone bodies were tested daily. Rates of glucose and oleate oxidation in liver, brown adipose tissue, and abdominal adipose tissue, leucine catabolism in skeletal muscle, and lipolysis in white and brown adipose tissues were greater for rats treated with 8 μg IFNT/kg BW/day in comparison with control rats. Treatment with 8 μg IFNT/kg BW/day increased heat production, reduced BW gain and adiposity, ameliorated fatty liver syndrome, delayed the onset of diabetes, and decreased concentrations of glucose, free fatty acids, triacylglycerol, cholesterol, and branched-chain amino acids in plasma, compared to control rats. Oral administration of 8 μg IFNT/kg BW/day ameliorated oxidative stress in skeletal muscle, liver and adipose tissue, as indicated by decreased ratios of oxidized glutathione to reduced glutathione and increased concentrations of the antioxidant tetrahydrobiopterin. These results indicate that IFNT stimulates oxidation of energy substrates and reduces obesity in ZDF rats and may have broad important implications for preventing and treating obesity-related diseases in mammals.
Obesity; interferon tau; diabetes; energy expenditure; tissue oxidation
When some of the regressors can act on both the response and other explanatory variables, the already challenging problem of selecting variables when the number of covariates exceeds the sample size becomes more difficult. A motivating example is a metabolic study in mice that has diet groups and gut microbial percentages that may affect changes in multiple phenotypes related to body weight regulation. The data have more variables than observations and diet is known to act directly on the phenotypes as well as on some or potentially all of the microbial percentages. Interest lies in determining which gut microflora influence the phenotypes while accounting for the direct relationship between diet and the other variables A new methodology for variable selection in this context is presented that links the concept of q-values from multiple hypothesis testing to the recently developed weighted Lasso.
False discovery rate; Microbial data; q-Values; Variable selection; Weighted Lasso
Statistical tables are an important component of data analysis and reports in biological sciences. However, the traditional manual processes for computation and presentation of statistically significant results using a letter-based algorithm are tedious and prone to errors.
Based on the R language, we present two web-based software for individual and summary data, freely available online, at http://shiny.stat.tamu.edu:3838/hassaad/Table_report1/ and http://shiny.stat.tamu.edu:3838/hassaad/SumAOV1/, respectively. The software are capable of rapidly generating publication-ready tables containing one-way analysis of variance (ANOVA) results. No download is required. Additionally, the software can perform multiple comparisons of means using the Duncan, Student-Newman-Keuls, Tukey Kramer, and Fisher’s least significant difference (LSD) tests. If the LSD test is selected, multiple methods (e.g., Bonferroni and Holm) are available for adjusting p-values. Using the software, the procedures of ANOVA can be completed within seconds using a web-browser, preferably Mozilla Firefox or Google Chrome, and a few mouse clicks. Furthermore, the software can handle one-way ANOVA for summary data (i.e. sample size, mean, and SD or SEM per treatment group) with post-hoc multiple comparisons among treatment means. To our awareness, none of the currently available commercial (e.g., SPSS and SAS) or open-source software (e.g., R and Python) can perform such a rapid task without advanced knowledge of the corresponding programming language.
Our new and user-friendly software to perform statistical analysis and generate publication-ready MS-Word tables for one-way ANOVA are expected to facilitate research in agriculture, biomedicine, and other fields of life sciences.
Statistical analysis; Multiple comparisons; Online software; Computation; Biology; R; Shiny
With the advent of Internet-based 24-hour recall (24HR) instruments, it is now possible to envision their use in cohort studies investigating the relation between nutrition and disease. Understanding that all dietary assessment instruments are subject to measurement errors and correcting for them under the assumption that the 24HR is unbiased for usual intake, here the authors simultaneously address precision, power, and sample size under the following 3 conditions: 1) 1–12 24HRs; 2) a single calibrated food frequency questionnaire (FFQ); and 3) a combination of 24HR and FFQ data. Using data from the Eating at America’s Table Study (1997–1998), the authors found that 4–6 administrations of the 24HR is optimal for most nutrients and food groups and that combined use of multiple 24HR and FFQ data sometimes provides data superior to use of either method alone, especially for foods that are not regularly consumed. For all food groups but the most rarely consumed, use of 2–4 recalls alone, with or without additional FFQ data, was superior to use of FFQ data alone. Thus, if self-administered automated 24HRs are to be used in cohort studies, 4–6 administrations of the 24HR should be considered along with administration of an FFQ.
combining dietary instruments; data collection; dietary assessment; energy adjustment; epidemiologic methods; measurement error; nutrient density; nutrient intake
We propose a multiple imputation estimator for parameter estimation in a quantile regression model when some covariates are missing at random. The estimation procedure fully utilizes the entire dataset to achieve increased efficiency, and the resulting coefficient estimators are root-n consistent and asymptotically normal. To protect against possible model misspecification, we further propose a shrinkage estimator, which automatically adjusts for possible bias. The finite sample performance of our estimator is investigated in a simulation study. Finally, we apply our methodology to part of the Eating at American’s Table Study data, investigating the association between two measures of dietary intake.
Missing data; Multiple imputation; Quantile regression; Regression quantile; Shrinkage estimation
Development of obesity in animals is affected by energy intake, dietary composition, and metabolism. Useful models for studying this metabolic problem are Sprague-Dawley rats fed low-fat (LF) or high-fat (HF) diets beginning at 28 days of age. Through experimental design, their dietary intakes of energy, protein, vitamins, and minerals per kg body weight (BW) do not differ in order to eliminate confounding factors in data interpretation. The 24-h energy expenditure of rats is measured using indirect calorimetry. A regression model is constructed to accurately predict BW gain based on diet, initial BW gain, and the principal component scores of respiratory quotient and heat production. Time-course data on metabolism (including energy expenditure) are analyzed using a mixed effect model that fits both fixed and random effects. Cluster analysis is employed to classify rats as normal-weight or obese. HF-fed rats are heavier than LF-fed rats, but rates of their heat production per kg non-fat mass do not differ. We conclude that metabolic conversion of dietary lipids into body fat primarily contributes to obesity in HF-fed rats.
Obesity; Energy Expenditure; Indirect Calorimetry; Statistical Analysis; Review
Systematic investigations into the structure of measurement error of physical activity questionnaires are lacking. We propose a measurement error model for a physical activity questionnaire that uses physical activity level (the ratio of total energy expenditure to basal energy expenditure) to relate questionnaire-based reports of physical activity level to true physical activity levels. The 1999–2006 National Health and Nutrition Examination Survey physical activity questionnaire was administered to 433 participants aged 40–69 years in the Observing Protein and Energy Nutrition (OPEN) Study (Maryland, 1999–2000). Valid estimates of participants’ total energy expenditure were also available from doubly labeled water, and basal energy expenditure was estimated from an equation; the ratio of those measures estimated true physical activity level (“truth”). We present a measurement error model that accommodates the mixture of errors that arise from assuming a classical measurement error model for doubly labeled water and a Berkson error model for the equation used to estimate basal energy expenditure. The method was then applied to the OPEN Study. Correlations between the questionnaire-based physical activity level and truth were modest (r = 0.32–0.41); attenuation factors (0.43–0.73) indicate that the use of questionnaire-based physical activity level would lead to attenuated estimates of effect size. Results suggest that sample sizes for estimating relationships between physical activity level and disease should be inflated, and that regression calibration can be used to provide measurement error–adjusted estimates of relationships between physical activity and disease.
Berkson model; bias; energy metabolism; measurement error model; models, statistical; motor activity; self-assessment
Partial differential equation (PDE) models are commonly used to model complex dynamic systems in applied sciences such as biology and finance. The forms of these PDE models are usually proposed by experts based on their prior knowledge and understanding of the dynamic system. Parameters in PDE models often have interesting scientific interpretations, but their values are often unknown, and need to be estimated from the measurements of the dynamic system in the present of measurement errors. Most PDEs used in practice have no analytic solutions, and can only be solved with numerical methods. Currently, methods for estimating PDE parameters require repeatedly solving PDEs numerically under thousands of candidate parameter values, and thus the computational load is high. In this article, we propose two methods to estimate parameters in PDE models: a parameter cascading method and a Bayesian approach. In both methods, the underlying dynamic process modeled with the PDE model is represented via basis function expansion. For the parameter cascading method, we develop two nested levels of optimization to estimate the PDE parameters. For the Bayesian method, we develop a joint model for data and the PDE, and develop a novel hierarchical model allowing us to employ Markov chain Monte Carlo (MCMC) techniques to make posterior inference. Simulation studies show that the Bayesian method and parameter cascading method are comparable, and both outperform other available methods in terms of estimation accuracy. The two methods are demonstrated by estimating parameters in a PDE model from LIDAR data.
Asymptotic theory; Basis function expansion; Bayesian method; Differential equations; Measurement error; Parameter cascading
We have demonstrated that diets containing fish oil and pectin (FO/P) reduce colon tumor incidence relative to control (corn oil and cellulose [CO/C]) in part by inducing apoptosis of DNA-damaged colon cells. Relative to FO/P, CO/C promotes colonocyte expression of the antiapoptotic modulator, Bcl-2, and Bcl-2 promoter methylation is altered in colon cancer. To determine if FO/P, compared with CO/C, limits Bcl-2 expression by enhancing promoter methylation in colon tumors, we examined Bcl-2 promoter methylation, mRNA levels, colonocyte apoptosis and colon tumor incidence in azoxymethane (AOM)-injected rats. Rats were provided diets containing FO/P or CO/C, and were terminated 16 and 34 weeks after AOM injection. DNA isolated from paraformaldehyde-fixed colon tumors and uninvolved tissue was bisulfite modified and amplified by quantitative reverese transcriptase-polymerase chain reaction to assess DNA methylation in Bcl-2 cytosine–guanosine islands. FO/P increased Bcl-2 promoter methylation (P = 0.009) in tumor tissues and colonocyte apoptosis (P = 0.020) relative to CO/C. An inverse correlation between Bcl-2 DNA methylation and Bcl-2 mRNA levels was observed in the tumors. We conclude that dietary FO/P promotes apoptosis in part by enhancing Bcl-2 promoter methylation. These Bcl-2 promoter methylation responses, measured in vivo, contribute to our understanding of the mechanisms involved in chemoprevention of colon cancer by diets containing FO/P.
Bcl-2; DNA methylation; epigenetics; fish oil; pectin
DNA methylation and histone acetylation contribute to the transcriptional regulation of genes involved in apoptosis. We have demonstrated that docosahexaenoic acid (DHA, 22:6 n-3) and butyrate enhance colonocyte apoptosis. To determine if DHA and/or butyrate elevate apoptosis through epigenetic mechanisms thereby restoring the transcription of apoptosis-related genes, we examined global methylation; gene-specific promoter methylation of 24 apoptosis-related genes; transcription levels of Cideb, Dapk1, and Tnfrsf25; and global histone acetylation in the HCT-116 colon cancer cell line. Cells were treated with combinations of (50 μM) DHA or linoleic acid (18:2 n-6), (5 mM) butyrate or an inhibitor of DNA methyltransferases, and 5-aza-2′-deoxycytidine (5-Aza-dC, 2 μM). Among highly methylated genes, the combination of DHA and butyrate significantly reduced methylation of the proapoptotic Bcl2l11, Cideb, Dapk1, Ltbr, and Tnfrsf25 genes compared to untreated control cells. DHA treatment reduced the methylation of Cideb, Dapk1, and Tnfrsf25. These data suggest that the induction of apoptosis by DHA and butyrate is mediated, in part, through changes in the methylation state of apoptosis-related genes.
docosahexaenoic acid; butyrate; apoptosis; DNA methylation; epigenetics
Primary analysis of case-control studies focuses on the relationship between disease (D) and a set of covariates of interest (Y, X). A secondary application of the case-control study, often invoked in modern genetic epidemiologic association studies, is to investigate the interrelationship between the covariates themselves. The task is complicated due to the case-control sampling, and to avoid the biased sampling that arises from the design, it is typical to use the control data only. In this paper, we develop penalized regression spline methodology that uses all the data, and improves precision of estimation compared to using only the controls. A simulation study and an empirical example are used to illustrate the methodology.
Biased samples; B-splines; Homoscedastic regression; Nonparametric regression; Regression splines; Secondary data; Secondary phenotypes; Two-stage samples
Efficient estimation of parameters is a major objective in analyzing longitudinal data. We propose two generalized empirical likelihood-based methods that take into consideration within-subject correlations. A nonparametric version of the Wilks theorem for the limiting distributions of the empirical likelihood ratios is derived. It is shown that one of the proposed methods is locally efficient among a class of within-subject variance-covariance matrices. A simulation study is conducted to investigate the finite sample properties of the proposed methods and compares them with the block empirical likelihood method by You et al. (2006) and the normal approximation with a correctly estimated variance-covariance. The results suggest that the proposed methods are generally more efficient than existing methods that ignore the correlation structure, and are better in coverage compared to the normal approximation with correctly specified within-subject correlation. An application illustrating our methods and supporting the simulation study results is presented.
Confidence region; Efficient estimation; Empirical likelihood; Longitudinal data; Maximum empirical likelihood estimator
The 1986 accident at the Chernobyl nuclear power plant remains the most serious nuclear accident in history, and excess thyroid cancers, particularly among those exposed to releases of iodine-131 remain the best-documented sequelae. Failure to take dose-measurement error into account can lead to bias in assessments of dose-response slope. Although risks in the Ukrainian-US thyroid screening study have been previously evaluated, errors in dose assessments have not been addressed hitherto. Dose-response patterns were examined in a thyroid screening prevalence cohort of 13,127 persons aged <18 at the time of the accident who were resident in the most radioactively contaminated regions of Ukraine. We extended earlier analyses in this cohort by adjusting for dose error in the recently developed TD-10 dosimetry. Three methods of statistical correction, via two types of regression calibration, and Monte Carlo maximum-likelihood, were applied to the doses that can be derived from the ratio of thyroid activity to thyroid mass. The two components that make up this ratio have different types of error, Berkson error for thyroid mass and classical error for thyroid activity. The first regression-calibration method yielded estimates of excess odds ratio of 5.78 Gy−1 (95% CI 1.92, 27.04), about 7% higher than estimates unadjusted for dose error. The second regression-calibration method gave an excess odds ratio of 4.78 Gy−1 (95% CI 1.64, 19.69), about 11% lower than unadjusted analysis. The Monte Carlo maximum-likelihood method produced an excess odds ratio of 4.93 Gy−1 (95% CI 1.67, 19.90), about 8% lower than unadjusted analysis. There are borderline-significant (p = 0.101–0.112) indications of downward curvature in the dose response, allowing for which nearly doubled the low-dose linear coefficient. In conclusion, dose-error adjustment has comparatively modest effects on regression parameters, a consequence of the relatively small errors, of a mixture of Berkson and classical form, associated with thyroid dose assessment.
The authors consider the analysis of hierarchical longitudinal functional data based upon a functional principal components approach. In contrast to standard frequentist approaches to selecting the number of principal components, the authors do model averaging using a Bayesian formulation. A relatively straightforward reversible jump Markov Chain Monte Carlo formulation has poor mixing properties and in simulated data often becomes trapped at the wrong number of principal components. In order to overcome this, the authors show how to apply Stochastic Approximation Monte Carlo (SAMC) to this problem, a method that has the potential to explore the entire space and does not become trapped in local extrema. The combination of reversible jump methods and SAMC in hierarchical longitudinal functional data is simplified by a polar coordinate representation of the principal components. The approach is easy to implement and does well in simulated data in determining the distribution of the number of principal components, and in terms of its frequentist estimation properties. Empirical applications are also presented.
Functional data analysis; Hierarchical models; Longitudinal data; Markov chain monte carlo; Principal Components; Stochastic approximation
Local polynomial estimators are popular techniques for nonparametric regression estimation and have received great attention in the literature. Their simplest version, the local constant estimator, can be easily extended to the errors-in-variables context by exploiting its similarity with the deconvolution kernel density estimator. The generalization of the higher order versions of the estimator, however, is not straightforward and has remained an open problem for the last 15 years. We propose an innovative local polynomial estimator of any order in the errors-in-variables context, derive its design-adaptive asymptotic properties and study its finite sample performance on simulated examples. We provide not only a solution to a long-standing open problem, but also provide methodological contributions to error-invariable regression, including local polynomial estimation of derivative functions.
Bandwidth selector; Deconvolution; Inverse problems; Local polynomial; Measurement errors; Nonparametric regression; Replicated measurements
It is of interest to estimate the distribution of usual nutrient intake for a population from repeat 24-h dietary recall assessments. A mixed effects model and quantile estimation procedure, developed at the National Cancer Institute (NCI), may be used for this purpose. The model incorporates a Box–Cox parameter and covariates to estimate usual daily intake of nutrients; model parameters are estimated via quasi-Newton optimization of a likelihood approximated by the adaptive Gaussian quadrature. The parameter estimates are used in a Monte Carlo approach to generate empirical quantiles; standard errors are estimated by bootstrap. The NCI method is illustrated and compared with current estimation methods, including the individual mean and the semi-parametric method developed at the Iowa State University (ISU), using data from a random sample and computer simulations. Both the NCI and ISU methods for nutrients are superior to the distribution of individual means. For simple (no covariate) models, quantile estimates are similar between the NCI and ISU methods. The bootstrap approach used by the NCI method to estimate standard errors of quantiles appears preferable to Taylor linearization. One major advantage of the NCI method is its ability to provide estimates for subpopulations through the incorporation of covariates into the model. The NCI method may be used for estimating the distribution of usual nutrient intake for populations and subpopulations as part of a unified framework of estimation of usual intake of dietary constituents.
statistical distributions; diet surveys; nutrition assessment; mixed-effects model; nutrients; percentiles