Search tips
Search criteria

Results 1-25 (94)

Clipboard (0)

Select a Filter Below

more »
Year of Publication
more »
1.  Estimating the Distribution of Dietary Consumption Patterns 
In the United States the preferred method of obtaining dietary intake data is the 24-hour dietary recall, yet the measure of most interest is usual or long-term average daily intake, which is impossible to measure. Thus, usual dietary intake is assessed with considerable measurement error. We were interested in estimating the population distribution of the Healthy Eating Index-2005 (HEI-2005), a multi-component dietary quality index involving ratios of interrelated dietary components to energy, among children aged 2-8 in the United States, using a national survey and incorporating survey weights. We developed a highly nonlinear, multivariate zero-inflated data model with measurement error to address this question. Standard nonlinear mixed model software such as SAS NLMIXED cannot handle this problem. We found that taking a Bayesian approach, and using MCMC, resolved the computational issues and doing so enabled us to provide a realistic distribution estimate for the HEI-2005 total score. While our computation and thinking in solving this problem was Bayesian, we relied on the well-known close relationship between Bayesian posterior means and maximum likelihood, the latter not computationally feasible, and thus were able to develop standard errors using balanced repeated replication, a survey-sampling approach.
PMCID: PMC4189114  PMID: 25309033
Bayesian methods; Dietary assessment; Latent variables; Measurement error; Mixed models; Nutritional epidemiology; Nutritional surveillance; Zero-Inflated Data
2.  Robust estimation for homoscedastic regression in the secondary analysis of case–control data 
Primary analysis of case–control studies focuses on the relationship between disease D and a set of covariates of interest (Y, X). A secondary application of the case–control study, which is often invoked in modern genetic epidemiologic association studies, is to investigate the interrelationship between the covariates themselves. The task is complicated owing to the case–control sampling, where the regression of Y on X is different from what it is in the population. Previous work has assumed a parametric distribution for Y given X and derived semiparametric efficient estimation and inference without any distributional assumptions about X. We take up the issue of estimation of a regression function when Y given X follows a homoscedastic regression model, but otherwise the distribution of Y is unspecified. The semiparametric efficient approaches can be used to construct semiparametric efficient estimates, but they suffer from a lack of robustness to the assumed model for Y given X. We take an entirely different approach. We show how to estimate the regression parameters consistently even if the assumed model for Y given X is incorrect, and thus the estimates are model robust. For this we make the assumption that the disease rate is known or well estimated. The assumption can be dropped when the disease is rare, which is typically so for most case–control studies, and the estimation algorithm simplifies. Simulations and empirical examples are used to illustrate the approach.
PMCID: PMC3639015  PMID: 23637568
Biased samples; Homoscedastic regression; Secondary data; Secondary phenotypes; Semiparametric inference; Two-stage samples
3.  Taking Advantage of the Strengths of 2 Different Dietary Assessment Instruments to Improve Intake Estimates for Nutritional Epidemiology 
American Journal of Epidemiology  2012;175(4):340-347.
With the advent of Internet-based 24-hour recall (24HR) instruments, it is now possible to envision their use in cohort studies investigating the relation between nutrition and disease. Understanding that all dietary assessment instruments are subject to measurement errors and correcting for them under the assumption that the 24HR is unbiased for usual intake, here the authors simultaneously address precision, power, and sample size under the following 3 conditions: 1) 1–12 24HRs; 2) a single calibrated food frequency questionnaire (FFQ); and 3) a combination of 24HR and FFQ data. Using data from the Eating at America’s Table Study (1997–1998), the authors found that 4–6 administrations of the 24HR is optimal for most nutrients and food groups and that combined use of multiple 24HR and FFQ data sometimes provides data superior to use of either method alone, especially for foods that are not regularly consumed. For all food groups but the most rarely consumed, use of 2–4 recalls alone, with or without additional FFQ data, was superior to use of FFQ data alone. Thus, if self-administered automated 24HRs are to be used in cohort studies, 4–6 administrations of the 24HR should be considered along with administration of an FFQ.
PMCID: PMC3271815  PMID: 22273536
combining dietary instruments; data collection; dietary assessment; energy adjustment; epidemiologic methods; measurement error; nutrient density; nutrient intake
4.  Chemopreventive n-3 Polyunsaturated Fatty Acids Reprogram Genetic Signatures during Colon Cancer Initiation and Progression in the Rat 
Cancer research  2004;64(18):6797-6804.
The mechanisms by which n-3 polyunsaturated fatty acids (PUFAs) decrease colon tumor formation have not been fully elucidated. Examination of genes up- or down-regulated at various stages of tumor development via the monitoring of gene expression relationships will help to determine the biological processes ultimately responsible for the protective effects of n-3 PUFA. Therefore, using a 3 × × × 2 factorial design, we used Codelink DNA microarrays containing ∼9000 genes to help decipher the global changes in colonocyte gene expression profiles in carcinogen-injected Sprague Dawley rats. Animals were assigned to three dietary treatments differing only in the type of fat (corn oil/n-6 PUFA, fish oil/n-3 PUFA, or olive oil/n-9 monounsaturated fatty acid), two treatments (injection with the carcinogen azoxymethane or with saline), and two time points (12 hours and 10 weeks after first injection). Only the consumption of n-3 PUFA exerted a protective effect at the initiation (DNA adduct formation) and promotional (aberrant crypt foci) stages. Importantly, microarray analysis of colonocyte gene expression profiles discerned fundamental differences among animals treated with n-3 PUFA at both the 12 hours and 10-week time points. Thus, in addition to demonstrating that dietary fat composition alters the molecular portrait of gene expression profiles in the colonic epithelium at both the initiation and promotional stages of tumor development, these findings indicate that the chemopreventive effect of fish oil is due to the direct action of n-3 PUFA and not to a reduction in the content of n-6 PUFA.
PMCID: PMC4459750  PMID: 15374999
5.  Significance tests for functional data with complex dependence structure 
We propose an L2-norm based global testing procedure for the null hypothesis that multiple group mean functions are equal, for functional data with complex dependence structure. Specifically, we consider the setting of functional data with a multilevel structure of the form groups–clusters or subjects–units, where the unit-level profiles are spatially correlated within the cluster, and the cluster-level data are independent. Orthogonal series expansions are used to approximate the group mean functions and the test statistic is estimated using the basis coefficients. The asymptotic null distribution of the test statistic is developed, under mild regularity conditions. To our knowledge this is the first work that studies hypothesis testing, when data have such complex multilevel functional and spatial structure. Two small-sample alternatives, including a novel block bootstrap for functional data, are proposed, and their performance is examined in simulation studies. The paper concludes with an illustration of a motivating experiment.
PMCID: PMC4443904  PMID: 26023253
Block bootstrap; Functional data; Group mean testing; Hierarchical modeling; Significance tests; Spatially correlated curves
6.  Identification of important regressor groups, subgroups and individuals via regularization methods: application to gut microbiome data 
Bioinformatics  2013;30(6):831-837.
Motivation: Gut microbiota can be classified at multiple taxonomy levels. Strategies to use changes in microbiota composition to effect health improvements require knowing at which taxonomy level interventions should be aimed. Identifying these important levels is difficult, however, because most statistical methods only consider when the microbiota are classified at one taxonomy level, not multiple.
Results: Using L1 and L2 regularizations, we developed a new variable selection method that identifies important features at multiple taxonomy levels. The regularization parameters are chosen by a new, data-adaptive, repeated cross-validation approach, which performed well. In simulation studies, our method outperformed competing methods: it more often selected significant variables, and had small false discovery rates and acceptable false-positive rates. Applying our method to gut microbiota data, we found which taxonomic levels were most altered by specific interventions or physiological status.
Availability: The new approach is implemented in an R package, which is freely available from the corresponding author.
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC3957069  PMID: 24162467
7.  Generalized empirical likelihood methods for analyzing longitudinal data 
Biometrika  2010;97(1):79-93.
Efficient estimation of parameters is a major objective in analyzing longitudinal data. We propose two generalized empirical likelihood-based methods that take into consideration within-subject correlations. A nonparametric version of the Wilks theorem for the limiting distributions of the empirical likelihood ratios is derived. It is shown that one of the proposed methods is locally efficient among a class of within-subject variance-covariance matrices. A simulation study is conducted to investigate the finite sample properties of the proposed methods and compares them with the block empirical likelihood method by You et al. (2006) and the normal approximation with a correctly estimated variance-covariance. The results suggest that the proposed methods are generally more efficient than existing methods that ignore the correlation structure, and are better in coverage compared to the normal approximation with correctly specified within-subject correlation. An application illustrating our methods and supporting the simulation study results is presented.
PMCID: PMC2841365  PMID: 20305730
Confidence region; Efficient estimation; Empirical likelihood; Longitudinal data; Maximum empirical likelihood estimator
8.  Longitudinal functional principal component modeling via Stochastic Approximation Monte Carlo 
The authors consider the analysis of hierarchical longitudinal functional data based upon a functional principal components approach. In contrast to standard frequentist approaches to selecting the number of principal components, the authors do model averaging using a Bayesian formulation. A relatively straightforward reversible jump Markov Chain Monte Carlo formulation has poor mixing properties and in simulated data often becomes trapped at the wrong number of principal components. In order to overcome this, the authors show how to apply Stochastic Approximation Monte Carlo (SAMC) to this problem, a method that has the potential to explore the entire space and does not become trapped in local extrema. The combination of reversible jump methods and SAMC in hierarchical longitudinal functional data is simplified by a polar coordinate representation of the principal components. The approach is easy to implement and does well in simulated data in determining the distribution of the number of principal components, and in terms of its frequentist estimation properties. Empirical applications are also presented.
PMCID: PMC2915799  PMID: 20689648
Functional data analysis; Hierarchical models; Longitudinal data; Markov chain monte carlo; Principal Components; Stochastic approximation
9.  Rapid publication-ready MS-Word tables for two-way ANOVA 
SpringerPlus  2015;4:33.
Statistical tables are an essential component of scientific papers and reports in biomedical and agricultural sciences. Measurements in these tables are summarized as mean ± SEM for each treatment group. Results from pairwise-comparison tests are often included using letter displays, in which treatment means that are not significantly different, are followed by a common letter. However, the traditional manual processes for computation and presentation of statistically significant outcomes in MS Word tables using a letter-based algorithm are tedious and prone to errors.
Using the R package ‘Shiny’, we present a web-based program freely available online, at No download is required. The program is capable of rapidly generating publication-ready tables containing two-way analysis of variance (ANOVA) results. Additionally, the software can perform multiple comparisons of means using the Duncan, Student-Newman-Keuls, Tukey Kramer, Westfall, and Fisher’s least significant difference (LSD) tests. If the LSD test is selected, multiple methods (e.g., Bonferroni and Holm) are available for adjusting p-values. Significance statements resulting from all pairwise comparisons are included in the table using the popular letter display algorithm. With the application of our software, the procedures of ANOVA can be completed within seconds using a web-browser, preferably Mozilla Firefox or Google Chrome, and a few mouse clicks. To our awareness, none of the currently available commercial (e.g., Stata, SPSS and SAS) or open-source software (e.g., R and Python) can perform such a rapid task without advanced knowledge of the corresponding programming language.
The new and user-friendly program described in this paper should help scientists perform statistical analysis and rapidly generate publication-ready MS-Word tables for two-way ANOVA. Our software is expected to facilitate research in agriculture, biomedicine, and other fields of life sciences.
PMCID: PMC4305362  PMID: 25635246
Two-way ANOVA; Multiple comparisons; Online software; Statistical analysis; Biology; Agriculture; R; Shiny
10.  A Design-Adaptive Local Polynomial Estimator for the Errors-in-Variables Problem 
Local polynomial estimators are popular techniques for nonparametric regression estimation and have received great attention in the literature. Their simplest version, the local constant estimator, can be easily extended to the errors-in-variables context by exploiting its similarity with the deconvolution kernel density estimator. The generalization of the higher order versions of the estimator, however, is not straightforward and has remained an open problem for the last 15 years. We propose an innovative local polynomial estimator of any order in the errors-in-variables context, derive its design-adaptive asymptotic properties and study its finite sample performance on simulated examples. We provide not only a solution to a long-standing open problem, but also provide methodological contributions to error-invariable regression, including local polynomial estimation of derivative functions.
PMCID: PMC2846380  PMID: 20351800
Bandwidth selector; Deconvolution; Inverse problems; Local polynomial; Measurement errors; Nonparametric regression; Replicated measurements
11.  Testing in semiparametric models with interaction, with applications to gene-environment interactions 
Motivated from the problem of testing for genetic effects on complex traits in the presence of gene-environment interaction, we develop score tests in general semiparametric regression problems that involves Tukey style 1 degree-of-freedom form of interaction between parametrically and non-parametrically modelled covariates. We find that the score test in this type of model, as recently developed by Chatterjee and co-workers in the fully parametric setting, is biased and requires undersmoothing to be valid in the presence of non-parametric components. Moreover, in the presence of repeated outcomes, the asymptotic distribution of the score test depends on the estimation of functions which are defined as solutions of integral equations, making implementation difficult and computationally taxing. We develop profiled score statistics which are unbiased and asymptotically efficient and can be performed by using standard bandwidth selection methods. In addition, to overcome the difficulty of solving functional equations, we give easy interpretations of the target functions, which in turn allow us to develop estimation procedures that can be easily implemented by using standard computational methods. We present simulation studies to evaluate type I error and power of the method proposed compared with a naive test that does not consider interaction. Finally, we illustrate our methodology by analysing data from a case-control study of colorectal adenoma that was designed to investigate the association between colorectal adenoma and the candidate gene NAT2 in relation to smoking history.
PMCID: PMC2762226  PMID: 19838317
Additive models; Diplotypes; Function estimation; Non-parametric regression; Omnibus hypothesis testing; Partially linear model; Repeated measures; Score test; Semiparametric models; Smooth backfitting; Tukey’s 1 degree-of-freedom model
12.  Selecting the Number of Principal Components in Functional Data 
Journal of the American Statistical Association  2013;108(504):10.1080/01621459.2013.788980.
Functional principal component analysis (FPCA) has become the most widely used dimension reduction tool for functional data analysis. We consider functional data measured at random, subject-specific time points, contaminated with measurement error, allowing for both sparse and dense functional data, and propose novel information criteria to select the number of principal component in such data. We propose a Bayesian information criterion based on marginal modeling that can consistently select the number of principal components for both sparse and dense functional data. For dense functional data, we also developed an Akaike information criterion (AIC) based on the expected Kullback-Leibler information under a Gaussian assumption. In connecting with factor analysis in multivariate time series data, we also consider the information criteria by Bai & Ng (2002) and show that they are still consistent for dense functional data, if a prescribed undersmoothing scheme is undertaken in the FPCA algorithm. We perform intensive simulation studies and show that the proposed information criteria vastly outperform existing methods for this type of data. Surprisingly, our empirical evidence shows that our information criteria proposed for dense functional data also perform well for sparse functional data. An empirical example using colon carcinogenesis data is also provided to illustrate the results.
PMCID: PMC3872138  PMID: 24376287
Akaike information criterion; Bayesian information criterion; Functional data analysis; Kernel smoothing; Principal components
13.  Shrinkage Estimators for Robust and Efficient Inference in Haplotype-Based Case-Control Studies 
Case-control association studies often aim to investigate the role of genes and gene-environment interactions in terms of the underlying haplotypes (i.e., the combinations of alleles at multiple genetic loci along chromosomal regions). The goal of this article is to develop robust but efficient approaches to the estimation of disease odds-ratio parameters associated with haplotypes and haplotype-environment interactions. We consider “shrinkage” estimation techniques that can adaptively relax the model assumptions of Hardy-Weinberg-Equilibrium and gene-environment independence required by recently proposed efficient “retrospective” methods. Our proposal involves first development of a novel retrospective approach to the analysis of case-control data, one that is robust to the nature of the gene-environment distribution in the underlying population. Next, it involves shrinkage of the robust retrospective estimator toward a more precise, but model-dependent, retrospective estimator using novel empirical Bayes and penalized regression techniques. Methods for variance estimation are proposed based on asymptotic theories. Simulations and two data examples illustrate both the robustness and efficiency of the proposed methods.
PMCID: PMC2679507  PMID: 19430598
Empirical Bayes; Genetic epidemiology; LASSO (Least Absolute Shrinkage and Selection Operator); Model averaging; Model robustness; Model selection
14.  Multiple Indicators, Multiple Causes Measurement Error Models 
Statistics in medicine  2014;33(25):4469-4481.
Multiple Indicators, Multiple Causes Models (MIMIC) are often employed by researchers studying the effects of an unobservable latent variable on a set of outcomes, when causes of the latent variable are observed. There are times however when the causes of the latent variable are not observed because measurements of the causal variable are contaminated by measurement error. The objectives of this paper are: (1) to develop a novel model by extending the classical linear MIMIC model to allow both Berkson and classical measurement errors, defining the MIMIC measurement error (MIMIC ME) model, (2) to develop likelihood based estimation methods for the MIMIC ME model, (3) to apply the newly defined MIMIC ME model to atomic bomb survivor data to study the impact of dyslipidemia and radiation dose on the physical manifestations of dyslipidemia. As a by-product of our work, we also obtain a data-driven estimate of the variance of the classical measurement error associated with an estimate of the amount of radiation dose received by atomic bomb survivors at the time of their exposure.
PMCID: PMC4184955  PMID: 24962535
Atomic bomb survivor data; Berkson error; Dyslipidemia; Instrumental Variables; Latent variables; Measurement error; MIMIC models
15.  Bayesian Semiparametric Density Deconvolution in the Presence of Conditionally Heteroscedastic Measurement Errors 
We consider the problem of estimating the density of a random variable when precise measurements on the variable are not available, but replicated proxies contaminated with measurement error are available for sufficiently many subjects. Under the assumption of additive measurement errors this reduces to a problem of deconvolution of densities. Deconvolution methods often make restrictive and unrealistic assumptions about the density of interest and the distribution of measurement errors, e.g., normality and homoscedasticity and thus independence from the variable of interest. This article relaxes these assumptions and introduces novel Bayesian semiparametric methodology based on Dirichlet process mixture models for robust deconvolution of densities in the presence of conditionally heteroscedastic measurement errors. In particular, the models can adapt to asymmetry, heavy tails and multimodality. In simulation experiments, we show that our methods vastly outperform a recent Bayesian approach based on estimating the densities via mixtures of splines. We apply our methods to data from nutritional epidemiology. Even in the special case when the measurement errors are homoscedastic, our methodology is novel and dominates other methods that have been proposed previously. Additional simulation results, instructions on getting access to the data set and R programs implementing our methods are included as part of online supplemental materials.
PMCID: PMC4219602  PMID: 25378893
B-spline; Conditional heteroscedasticity; Density deconvolution; Dirichlet process mixture models; Measurement errors; Skew-normal distribution; Variance function
The annals of applied statistics  2013;41(6):2739-2767.
The data functions that are studied in the course of functional data analysis are assembled from discrete data, and the level of smoothing that is used is generally that which is appropriate for accurate approximation of the conceptually smooth functions that were not actually observed. Existing literature shows that this approach is effective, and even optimal, when using functional data methods for prediction or hypothesis testing. However, in the present paper we show that this approach is not effective in classification problems. There a useful rule of thumb is that undersmoothing is often desirable, but there are several surprising qualifications to that approach. First, the effect of smoothing the training data can be more significant than that of smoothing the new data set to be classified; second, undersmoothing is not always the right approach, and in fact in some cases using a relatively large bandwidth can be more effective; and third, these perverse results are the consequence of very unusual properties of error rates, expressed as functions of smoothing parameters. For example, the orders of magnitude of optimal smoothing parameter choices depend on the signs and sizes of terms in an expansion of error rate, and those signs and sizes can vary dramatically from one setting to another, even for the same classifier.
PMCID: PMC4191932  PMID: 25309640
Centroid method; discrimination; kernel smoothing; quadratic discrimination; smoothing parameter choice; training data
17.  Controlling the local false discovery rate in the adaptive Lasso 
Biostatistics (Oxford, England)  2013;14(4):653-666.
The Lasso shrinkage procedure achieved its popularity, in part, by its tendency to shrink estimated coefficients to zero, and its ability to serve as a variable selection procedure. Using data-adaptive weights, the adaptive Lasso modified the original procedure to increase the penalty terms for those variables estimated to be less important by ordinary least squares. Although this modified procedure attained the oracle properties, the resulting models tend to include a large number of “false positives” in practice. Here, we adapt the concept of local false discovery rates (lFDRs) so that it applies to the sequence, λn, of smoothing parameters for the adaptive Lasso. We define the lFDR for a given λn to be the probability that the variable added to the model by decreasing λn to λn−δ is not associated with the outcome, where δ is a small value. We derive the relationship between the lFDR and λn, show lFDR=1 for traditional smoothing parameters, and show how to select λn so as to achieve a desired lFDR. We compare the smoothing parameters chosen to achieve a specified lFDR and those chosen to achieve the oracle properties, as well as their resulting estimates for model coefficients, with both simulation and an example from a genetic study of prostate specific antigen.
PMCID: PMC3769997  PMID: 23575212
Adaptive Lasso; Local false discovery rate; Smoothing parameter; Variable selection
18.  Oral Administration of Interferon Tau Enhances Oxidation of Energy Substrates and Reduces Adiposity in Zucker Diabetic Fatty Rats 
BioFactors (Oxford, England)  2013;39(5):10.1002/biof.1113.
Male Zucker diabetic fatty (ZDF) rats were used to study effects of oral administration of interferon tau (IFNT) in reducing obesity. Eighteen ZDF rats (28 days of age) were assigned randomly to receive 0, 4 or 8 μg IFNT/kg body weight (BW) per day (n=6/group) for 8 weeks. Water consumption was measured every two days. Food intake and BW were recorded weekly. Energy expenditure in 4-, 6-, 8-, and 10-week-old rats was determined using indirect calorimetry. Starting at 7 weeks of age, urinary glucose and ketone bodies were tested daily. Rates of glucose and oleate oxidation in liver, brown adipose tissue, and abdominal adipose tissue, leucine catabolism in skeletal muscle, and lipolysis in white and brown adipose tissues were greater for rats treated with 8 μg IFNT/kg BW/day in comparison with control rats. Treatment with 8 μg IFNT/kg BW/day increased heat production, reduced BW gain and adiposity, ameliorated fatty liver syndrome, delayed the onset of diabetes, and decreased concentrations of glucose, free fatty acids, triacylglycerol, cholesterol, and branched-chain amino acids in plasma, compared to control rats. Oral administration of 8 μg IFNT/kg BW/day ameliorated oxidative stress in skeletal muscle, liver and adipose tissue, as indicated by decreased ratios of oxidized glutathione to reduced glutathione and increased concentrations of the antioxidant tetrahydrobiopterin. These results indicate that IFNT stimulates oxidation of energy substrates and reduces obesity in ZDF rats and may have broad important implications for preventing and treating obesity-related diseases in mammals.
PMCID: PMC3786024  PMID: 23804503
Obesity; interferon tau; diabetes; energy expenditure; tissue oxidation
19.  Structured variable selection with q-values 
Biostatistics (Oxford, England)  2013;14(4):695-707.
When some of the regressors can act on both the response and other explanatory variables, the already challenging problem of selecting variables when the number of covariates exceeds the sample size becomes more difficult. A motivating example is a metabolic study in mice that has diet groups and gut microbial percentages that may affect changes in multiple phenotypes related to body weight regulation. The data have more variables than observations and diet is known to act directly on the phenotypes as well as on some or potentially all of the microbial percentages. Interest lies in determining which gut microflora influence the phenotypes while accounting for the direct relationship between diet and the other variables A new methodology for variable selection in this context is presented that links the concept of q-values from multiple hypothesis testing to the recently developed weighted Lasso.
PMCID: PMC3841382  PMID: 23580317
False discovery rate; Microbial data; q-Values; Variable selection; Weighted Lasso
20.  Rapid publication-ready MS-Word tables for one-way ANOVA 
SpringerPlus  2014;3:474.
Statistical tables are an important component of data analysis and reports in biological sciences. However, the traditional manual processes for computation and presentation of statistically significant results using a letter-based algorithm are tedious and prone to errors.
Based on the R language, we present two web-based software for individual and summary data, freely available online, at and, respectively. The software are capable of rapidly generating publication-ready tables containing one-way analysis of variance (ANOVA) results. No download is required. Additionally, the software can perform multiple comparisons of means using the Duncan, Student-Newman-Keuls, Tukey Kramer, and Fisher’s least significant difference (LSD) tests. If the LSD test is selected, multiple methods (e.g., Bonferroni and Holm) are available for adjusting p-values. Using the software, the procedures of ANOVA can be completed within seconds using a web-browser, preferably Mozilla Firefox or Google Chrome, and a few mouse clicks. Furthermore, the software can handle one-way ANOVA for summary data (i.e. sample size, mean, and SD or SEM per treatment group) with post-hoc multiple comparisons among treatment means. To our awareness, none of the currently available commercial (e.g., SPSS and SAS) or open-source software (e.g., R and Python) can perform such a rapid task without advanced knowledge of the corresponding programming language.
Our new and user-friendly software to perform statistical analysis and generate publication-ready MS-Word tables for one-way ANOVA are expected to facilitate research in agriculture, biomedicine, and other fields of life sciences.
PMCID: PMC4153873  PMID: 25191639
Statistical analysis; Multiple comparisons; Online software; Computation; Biology; R; Shiny
21.  Multiple imputation in quantile regression 
Biometrika  2012;99(2):423-438.
We propose a multiple imputation estimator for parameter estimation in a quantile regression model when some covariates are missing at random. The estimation procedure fully utilizes the entire dataset to achieve increased efficiency, and the resulting coefficient estimators are root-n consistent and asymptotically normal. To protect against possible model misspecification, we further propose a shrinkage estimator, which automatically adjusts for possible bias. The finite sample performance of our estimator is investigated in a simulation study. Finally, we apply our methodology to part of the Eating at American’s Table Study data, investigating the association between two measures of dietary intake.
PMCID: PMC4059083  PMID: 24944347
Missing data; Multiple imputation; Quantile regression; Regression quantile; Shrinkage estimation
22.  Analysis of energy expenditure in diet-induced obese rats 
Development of obesity in animals is affected by energy intake, dietary composition, and metabolism. Useful models for studying this metabolic problem are Sprague-Dawley rats fed low-fat (LF) or high-fat (HF) diets beginning at 28 days of age. Through experimental design, their dietary intakes of energy, protein, vitamins, and minerals per kg body weight (BW) do not differ in order to eliminate confounding factors in data interpretation. The 24-h energy expenditure of rats is measured using indirect calorimetry. A regression model is constructed to accurately predict BW gain based on diet, initial BW gain, and the principal component scores of respiratory quotient and heat production. Time-course data on metabolism (including energy expenditure) are analyzed using a mixed effect model that fits both fixed and random effects. Cluster analysis is employed to classify rats as normal-weight or obese. HF-fed rats are heavier than LF-fed rats, but rates of their heat production per kg non-fat mass do not differ. We conclude that metabolic conversion of dietary lipids into body fat primarily contributes to obesity in HF-fed rats.
PMCID: PMC4048864  PMID: 24896330
Obesity; Energy Expenditure; Indirect Calorimetry; Statistical Analysis; Review
23.  A Measurement Error Model for Physical Activity Level as Measured by a Questionnaire With Application to the 1999–2006 NHANES Questionnaire 
American Journal of Epidemiology  2013;177(11):1199-1208.
Systematic investigations into the structure of measurement error of physical activity questionnaires are lacking. We propose a measurement error model for a physical activity questionnaire that uses physical activity level (the ratio of total energy expenditure to basal energy expenditure) to relate questionnaire-based reports of physical activity level to true physical activity levels. The 1999–2006 National Health and Nutrition Examination Survey physical activity questionnaire was administered to 433 participants aged 40–69 years in the Observing Protein and Energy Nutrition (OPEN) Study (Maryland, 1999–2000). Valid estimates of participants’ total energy expenditure were also available from doubly labeled water, and basal energy expenditure was estimated from an equation; the ratio of those measures estimated true physical activity level (“truth”). We present a measurement error model that accommodates the mixture of errors that arise from assuming a classical measurement error model for doubly labeled water and a Berkson error model for the equation used to estimate basal energy expenditure. The method was then applied to the OPEN Study. Correlations between the questionnaire-based physical activity level and truth were modest (r = 0.32–0.41); attenuation factors (0.43–0.73) indicate that the use of questionnaire-based physical activity level would lead to attenuated estimates of effect size. Results suggest that sample sizes for estimating relationships between physical activity level and disease should be inflated, and that regression calibration can be used to provide measurement error–adjusted estimates of relationships between physical activity and disease.
PMCID: PMC3664335  PMID: 23595007
Berkson model; bias; energy metabolism; measurement error model; models, statistical; motor activity; self-assessment
24.  Regression Calibration Is Valid When Properly Applied 
Epidemiology (Cambridge, Mass.)  2013;24(3):466-467.
PMCID: PMC4035111  PMID: 23549186
25.  Parameter Estimation of Partial Differential Equation Models 
Journal of the American Statistical Association  2013;108(503):10.1080/01621459.2013.794730.
Partial differential equation (PDE) models are commonly used to model complex dynamic systems in applied sciences such as biology and finance. The forms of these PDE models are usually proposed by experts based on their prior knowledge and understanding of the dynamic system. Parameters in PDE models often have interesting scientific interpretations, but their values are often unknown, and need to be estimated from the measurements of the dynamic system in the present of measurement errors. Most PDEs used in practice have no analytic solutions, and can only be solved with numerical methods. Currently, methods for estimating PDE parameters require repeatedly solving PDEs numerically under thousands of candidate parameter values, and thus the computational load is high. In this article, we propose two methods to estimate parameters in PDE models: a parameter cascading method and a Bayesian approach. In both methods, the underlying dynamic process modeled with the PDE model is represented via basis function expansion. For the parameter cascading method, we develop two nested levels of optimization to estimate the PDE parameters. For the Bayesian method, we develop a joint model for data and the PDE, and develop a novel hierarchical model allowing us to employ Markov chain Monte Carlo (MCMC) techniques to make posterior inference. Simulation studies show that the Bayesian method and parameter cascading method are comparable, and both outperform other available methods in terms of estimation accuracy. The two methods are demonstrated by estimating parameters in a PDE model from LIDAR data.
PMCID: PMC3867159  PMID: 24363476
Asymptotic theory; Basis function expansion; Bayesian method; Differential equations; Measurement error; Parameter cascading

Results 1-25 (94)