With the advent of Internet-based 24-hour recall (24HR) instruments, it is now possible to envision their use in cohort studies investigating the relation between nutrition and disease. Understanding that all dietary assessment instruments are subject to measurement errors and correcting for them under the assumption that the 24HR is unbiased for usual intake, here the authors simultaneously address precision, power, and sample size under the following 3 conditions: 1) 1–12 24HRs; 2) a single calibrated food frequency questionnaire (FFQ); and 3) a combination of 24HR and FFQ data. Using data from the Eating at America’s Table Study (1997–1998), the authors found that 4–6 administrations of the 24HR is optimal for most nutrients and food groups and that combined use of multiple 24HR and FFQ data sometimes provides data superior to use of either method alone, especially for foods that are not regularly consumed. For all food groups but the most rarely consumed, use of 2–4 recalls alone, with or without additional FFQ data, was superior to use of FFQ data alone. Thus, if self-administered automated 24HRs are to be used in cohort studies, 4–6 administrations of the 24HR should be considered along with administration of an FFQ.
combining dietary instruments; data collection; dietary assessment; energy adjustment; epidemiologic methods; measurement error; nutrient density; nutrient intake
To show how the variance of the measurement error (ME) associated with individual ancestry proportion estimates can be estimated, especially when the number of ancestral populations (k) is greater than 2.
We extend existing internal consistency measures to estimate the ME variance, and we compare these estimates with the ME variance estimated by use of the repeated measurement (RM) approach. Both approaches work by dividing the genotyped markers into subsets. We examine the effect of the number of subsets and of the allocation of markers to each subset on the performance of each approach. We used simulated data for all comparisons.
Independently of the value of k, the measures of internal reliability provided less biased and more precise estimates of the ME variance than did those obtained with the RM approach. Both methods tend to perform better when a large number of subsets of markers with similar sizes are considered.
Our results will facilitate the use of ME correction methods to address the ME problem in individual ancestry proportion estimates. Our method will improve the ability to control for type I error inflation and loss of power in association tests and other genomic research involving ancestry estimates.
Population stratification; admixture; type I error inflation; reliability; Cronbach’s alpha; measurement errors; measurement error variance
In recent years, genome-wide association studies (GWAS) and gene-expression profiling have generated a large number of valuable datasets for assessing how genetic variations are related to disease outcomes. With such datasets, it is often of interest to assess the overall effect of a set of genetic markers, assembled based on biological knowledge. Genetic marker-set analyses have been advocated as more reliable and powerful approaches compared with the traditional marginal approaches (Curtis and others, 2005. Pathways to the analysis of microarray data. TRENDS in Biotechnology
23, 429–435; Efroni and others, 2007. Identification of key processes underlying cancer phenotypes using biologic pathway analysis. PLoS One
2, 425). Procedures for testing the overall effect of a marker-set have been actively studied in recent years. For example, score tests derived under an Empirical Bayes (EB) framework (Liu and others, 2007. Semiparametric regression of multidimensional genetic pathway data: least-squares kernel machines and linear mixed models. Biometrics
63, 1079–1088; Liu and others, 2008. Estimation and testing for the effect of a genetic pathway on a disease outcome using logistic kernel machine regression via logistic mixed models. BMC bioinformatics
9, 292–2; Wu and others, 2010. Powerful SNP-set analysis for case-control genome-wide association studies. American Journal of Human Genetics
86, 929) have been proposed as powerful alternatives to the standard Rao score test (Rao, 1948. Large sample tests of statistical hypotheses concerning several parameters with applications to problems of estimation. Mathematical Proceedings of the Cambridge Philosophical Society, 44, 50–57). The advantages of these EB-based tests are most apparent when the markers are correlated, due to the reduction in the degrees of freedom. In this paper, we propose an adaptive score test which up- or down-weights the contributions from each member of the marker-set based on the Z-scores of their effects. Such an adaptive procedure gains power over the existing procedures when the signal is sparse and the correlation among the markers is weak. By combining evidence from both the EB-based score test and the adaptive test, we further construct an omnibus test that attains good power in most settings. The null distributions of the proposed test statistics can be approximated well either via simple perturbation procedures or via distributional approximations. Through extensive simulation studies, we demonstrate that the proposed procedures perform well in finite samples. We apply the tests to a breast cancer genetic study to assess the overall effect of the FGFR2 gene on breast cancer risk.
Adaptive procedures; Empirical Bayes; GWAS; Pathway analysis; Score test; SNP sets
We describe a new approach to analyze chirp syllables of free-tailed bats from two regions of Texas in which they are predominant: Austin and College Station. Our goal is to characterize any systematic regional differences in the mating chirps and assess whether individual bats have signature chirps. The data are analyzed by modeling spectrograms of the chirps as responses in a Bayesian functional mixed model. Given the variable chirp lengths, we compute the spectrograms on a relative time scale interpretable as the relative chirp position, using a variable window overlap based on chirp length. We use 2D wavelet transforms to capture correlation within the spectrogram in our modeling and obtain adaptive regularization of the estimates and inference for the regions-specific spectrograms. Our model includes random effect spectrograms at the bat level to account for correlation among chirps from the same bat, and to assess relative variability in chirp spectrograms within and between bats. The modeling of spectrograms using functional mixed models is a general approach for the analysis of replicated nonstationary time series, such as our acoustical signals, to relate aspects of the signals to various predictors, while accounting for between-signal structure. This can be done on raw spectrograms when all signals are of the same length, and can be done using spectrograms defined on a relative time scale for signals of variable length in settings where the idea of defining correspondence across signals based on relative position is sensible.
Bat Syllable; Bayesian Analysis; Chirp; Functional Data Analysis; Functional Mixed Model; Isomorphic Transformation; Nonstationary Time Series; Software; Spectrogram; Variable Overlap
Motivation: Protein abundance in quantitative proteomics is often based on observed spectral features derived from liquid chromatography mass spectrometry (LC-MS) or LC-MS/MS experiments. Peak intensities are largely non–normal in distribution. Furthermore, LC-MS-based proteomics data frequently have large proportions of missing peak intensities due to censoring mechanisms on low-abundance spectral features. Recognizing that the observed peak intensities detected with the LC-MS method are all positive, skewed and often left-censored, we propose using survival methodology to carry out differential expression analysis of proteins. Various standard statistical techniques including non-parametric tests such as the Kolmogorov–Smirnov and Wilcoxon–Mann–Whitney rank sum tests, and the parametric survival model and accelerated failure time-model with log-normal, log-logistic and Weibull distributions were used to detect any differentially expressed proteins. The statistical operating characteristics of each method are explored using both real and simulated datasets.
Results: Survival methods generally have greater statistical power than standard differential expression methods when the proportion of missing protein level data is 5% or more. In particular, the AFT models we consider consistently achieve greater statistical power than standard testing procedures, with the discrepancy widening with increasing missingness in the proportions.
Availability: The testing procedures discussed in this article can all be performed using readily available software such as R. The R codes are provided as supplemental materials.
Functional data analysis has received considerable recent attention and a number of successful applications have been reported. In this paper, asymptotically simultaneous confidence bands are obtained for the mean function of the functional regression model, using piecewise constant spline estimation. Simulation experiments corroborate the asymptotic theory. The confidence band procedure is illustrated by analyzing CD4 cell counts of HIV infected patients.
B spline; confidence band; functional data; Karhunen-Loève L2 representation; knots; longitudinal data; strong approximation
Efficient estimation of parameters is a major objective in analyzing longitudinal data. We propose two generalized empirical likelihood-based methods that take into consideration within-subject correlations. A nonparametric version of the Wilks theorem for the limiting distributions of the empirical likelihood ratios is derived. It is shown that one of the proposed methods is locally efficient among a class of within-subject variance-covariance matrices. A simulation study is conducted to investigate the finite sample properties of the proposed methods and compares them with the block empirical likelihood method by You et al. (2006) and the normal approximation with a correctly estimated variance-covariance. The results suggest that the proposed methods are generally more efficient than existing methods that ignore the correlation structure, and are better in coverage compared to the normal approximation with correctly specified within-subject correlation. An application illustrating our methods and supporting the simulation study results is presented.
Confidence region; Efficient estimation; Empirical likelihood; Longitudinal data; Maximum empirical likelihood estimator
The authors consider the analysis of hierarchical longitudinal functional data based upon a functional principal components approach. In contrast to standard frequentist approaches to selecting the number of principal components, the authors do model averaging using a Bayesian formulation. A relatively straightforward reversible jump Markov Chain Monte Carlo formulation has poor mixing properties and in simulated data often becomes trapped at the wrong number of principal components. In order to overcome this, the authors show how to apply Stochastic Approximation Monte Carlo (SAMC) to this problem, a method that has the potential to explore the entire space and does not become trapped in local extrema. The combination of reversible jump methods and SAMC in hierarchical longitudinal functional data is simplified by a polar coordinate representation of the principal components. The approach is easy to implement and does well in simulated data in determining the distribution of the number of principal components, and in terms of its frequentist estimation properties. Empirical applications are also presented.
Functional data analysis; Hierarchical models; Longitudinal data; Markov chain monte carlo; Principal Components; Stochastic approximation
We study generalized additive partial linear models, proposing the use of polynomial spline smoothing for estimation of nonparametric functions, and deriving quasi-likelihood based estimators for the linear parameters. We establish asymptotic normality for the estimators of the parametric components. The procedure avoids solving large systems of equations as in kernel-based procedures and thus results in gains in computational simplicity. We further develop a class of variable selection procedures for the linear parameters by employing a nonconcave penalized quasi-likelihood, which is shown to have an asymptotic oracle property. Monte Carlo simulations and an empirical example are presented for illustration.
Backfitting; generalized additive models; generalized partially linear models; LASSO; nonconcave penalized likelihood; penalty-based variable selection; polynomial spline; quasi-likelihood; SCAD; shrinkage methods
The authors describe a statistical method of combining self-reports and biomarkers that, with adequate control for confounding, will provide nearly unbiased estimates of diet-disease associations and a valid test of the null hypothesis of no association. The method is based on regression calibration. In cases in which the diet-disease association is mediated by the biomarker, the association needs to be estimated as the total dietary effect in a mediation model. However, the hypothesis of no association is best tested through a marginal model that includes as the exposure the regression calibration-estimated intake but not the biomarker. The authors illustrate the method with data from the Carotenoids and Age-Related Eye Disease Study (2001--2004) and show that inclusion of the biomarker in the regression calibration-estimated intake increases the statistical power. This development sheds light on previous analyses of diet-disease associations reported in the literature.
bias (epidemiology); carotenoids; cataract; lutein; measurement error; sample size
Motivated from the problem of testing for genetic effects on complex traits in the presence of gene-environment interaction, we develop score tests in general semiparametric regression problems that involves Tukey style 1 degree-of-freedom form of interaction between parametrically and non-parametrically modelled covariates. We find that the score test in this type of model, as recently developed by Chatterjee and co-workers in the fully parametric setting, is biased and requires undersmoothing to be valid in the presence of non-parametric components. Moreover, in the presence of repeated outcomes, the asymptotic distribution of the score test depends on the estimation of functions which are defined as solutions of integral equations, making implementation difficult and computationally taxing. We develop profiled score statistics which are unbiased and asymptotically efficient and can be performed by using standard bandwidth selection methods. In addition, to overcome the difficulty of solving functional equations, we give easy interpretations of the target functions, which in turn allow us to develop estimation procedures that can be easily implemented by using standard computational methods. We present simulation studies to evaluate type I error and power of the method proposed compared with a naive test that does not consider interaction. Finally, we illustrate our methodology by analysing data from a case-control study of colorectal adenoma that was designed to investigate the association between colorectal adenoma and the candidate gene NAT2 in relation to smoking history.
Additive models; Diplotypes; Function estimation; Non-parametric regression; Omnibus hypothesis testing; Partially linear model; Repeated measures; Score test; Semiparametric models; Smooth backfitting; Tukey’s 1 degree-of-freedom model
Case-control association studies often aim to investigate the role of genes and gene-environment interactions in terms of the underlying haplotypes (i.e., the combinations of alleles at multiple genetic loci along chromosomal regions). The goal of this article is to develop robust but efficient approaches to the estimation of disease odds-ratio parameters associated with haplotypes and haplotype-environment interactions. We consider “shrinkage” estimation techniques that can adaptively relax the model assumptions of Hardy-Weinberg-Equilibrium and gene-environment independence required by recently proposed efficient “retrospective” methods. Our proposal involves first development of a novel retrospective approach to the analysis of case-control data, one that is robust to the nature of the gene-environment distribution in the underlying population. Next, it involves shrinkage of the robust retrospective estimator toward a more precise, but model-dependent, retrospective estimator using novel empirical Bayes and penalized regression techniques. Methods for variance estimation are proposed based on asymptotic theories. Simulations and two data examples illustrate both the robustness and efficiency of the proposed methods.
Empirical Bayes; Genetic epidemiology; LASSO (Least Absolute Shrinkage and Selection Operator); Model averaging; Model robustness; Model selection
We consider the problem of testing for a constant nonparametric effect in a general semi-parametric regression model when there is the potential for interaction between the parametrically and nonparametrically modeled variables. The work was originally motivated by a unique testing problem in genetic epidemiology (Chatterjee, et al., 2006) that involved a typical generalized linear model but with an additional term reminiscent of the Tukey one-degree-of-freedom formulation, and their interest was in testing for main effects of the genetic variables, while gaining statistical power by allowing for a possible interaction between genes and the environment. Later work (Maity, et al., 2009) involved the possibility of modeling the environmental variable nonparametrically, but they focused on whether there was a parametric main effect for the genetic variables. In this paper, we consider the complementary problem, where the interest is in testing for the main effect of the nonparametrically modeled environmental variable. We derive a generalized likelihood ratio test for this hypothesis, show how to implement it, and provide evidence that our method can improve statistical power when compared to standard partially linear models with main effects only. We use the method for the primary purpose of analyzing data from a case-control study of colorectal adenoma.
Function estimation; Generalized likelihood ratio; Interactions; Non-parametric regression; Partially linear logistic model; Partially linear model; Semiparametric models
Estimation of covariance function parameters of the error process in the presence of an unknown smooth trend is an important problem because solving it allows one to estimate the trend nonparametrically using a smoother corrected for dependence in the errors. Our work is motivated by spatial statistics but is applicable to other contexts where the dimension of the index set can exceed one. We obtain an estimator of the covariance function parameters by regressing squared differences of the response on their expectations, which equal the variogram plus an offset term induced by the trend. Existing estimators that ignore the trend produce bias in the estimates of the variogram parameters, which our procedure corrects for. Our estimator can be justified asymptotically under the increasing domain framework. Simulation studies suggest that our estimator compares favorably with those in the current literature while making less restrictive assumptions. We use our method to estimate the variogram parameters of the short-range spatial process in a U.S. precipitation data set.
Bias; Covariance function; Nonlinear regression; Nonparametric regression; Spatio-temporal dependence; Time series
Scientists in the fields of nutrition and other biological sciences often design factorial studies to test the hypotheses of interest and importance. In the case of two-factorial studies, it is widely recognized that the analysis of factor effects is generally based on treatment means when the interaction of the factors is statistically significant, and involves multiple comparisons of treatment means. However, when the two factors do not interact, a common understanding among biologists is that comparisons among treatment means cannot or should not be made. Here, we bring this misconception into the attention of researchers. Additionally, we indicate what kind of comparisons among the treatment means can be performed when there is a nonsignificant interaction among two factors. Such information should be useful in analyzing the experimental data and drawing meaningful conclusions.
Factor level means; Interaction; Main effects; Multiple comparison; Treatment means; Two-factor studies
To develop a method to validate an FFQ for reported intake of episodically consumed foods when the reference instrument measures short-term intake, and to apply the method in a large prospective cohort.
The FFQ was evaluated in a sub-study of cohort participants who, in addition to the questionnaire, were asked to complete two non-consecutive 24 h dietary recalls (24HR). FFQ-reported intakes of twenty-nine food groups were analysed using a two-part measurement error model that allows for nonconsumption on a given day, using 24HR as a reference instrument under the assumption that 24HR is unbiased for true intake at the individual level.
The National Institutes of Health–AARP Diet and Health Study, a cohort of 567 169 participants living in the USA and aged 50–71 years at baseline in 1995.
A sub-study of the cohort consisting of 2055 participants.
Estimated correlations of true and FFQ-reported energy-adjusted intakes were 0·5 or greater for most of the twenty-nine food groups evaluated, and estimated attenuation factors (a measure of bias in estimated diet–disease associations) were 0·4 or greater for most food groups.
The proposed methodology extends the class of foods and nutrients for which an FFQ can be evaluated in studies with short-term reference instruments. Although violations of the assumption that the 24HR is unbiased could be inflating some of the observed correlations and attenuation factors, results suggest that the FFQ is suitable for testing many, but not all, diet–disease hypotheses in a cohort of this size.
Diet; Food; Epidemiological methods; Questionnaires; Validation studies
Our objective was to identify the distribution of the endangered golden-cheeked warbler (Setophaga chrysoparia) in fragmented oak–juniper woodlands by applying a geoadditive semiparametric occupancy model to better assist decision-makers in identifying suitable habitat across the species breeding range on which conservation or mitigation activities can be focused and thus prioritize management and conservation planning.
We used repeated double-observer detection/non-detection surveys of randomly selected (n = 287) patches of potential habitat to evaluate warbler patch-scale presence across the species breeding range. We used a geoadditive semiparametric occupancy model with remotely sensed habitat metrics (patch size and landscape composition) to predict patch-scale occupancy of golden-cheeked warblers in the fragmented oak–juniper woodlands of central Texas, USA.
Our spatially explicit model indicated that golden-cheeked warbler patch occupancy declined from south to north within the breeding range concomitant with reductions in the availability of large habitat patches. We found that 59% of woodland patches, primarily in the northern and central portions of the warbler’s range, were predicted to have occupancy probabilities ≤0.10 with only 3% of patches predicted to have occupancy probabilities >0.90. Our model exhibited high prediction accuracy (area under curve = 0.91) when validated using independently collected warbler occurrence data.
We have identified a distinct spatial occurrence gradient for golden-cheeked warblers as well as a relationship between two measurable landscape characteristics. Because habitat-occupancy relationships were key drivers of our model, our results can be used to identify potential areas where conservation actions supporting habitat mitigation can occur and identify areas where conservation of future potential habitat is possible. Additionally, our results can be used to focus resources on maintenance and creation of patches that are more likely to harbour viable local warbler populations.
Bayesian inference; golden-cheeked warbler; habitat conservation; occupancy; semiparametric regression; Setophaga chrysoparia
We devise methods to estimate probability density functions of several populations using observations with uncertain population membership, meaning from which population an observation comes is unknown. The probability of an observation being sampled from any given population can be calculated. We develop general estimation procedures and bandwidth selection methods for our setting. We establish large-sample properties and study finite-sample performance using simulation studies. We illustrate our methods with data from a nutrition study.
Cross-validation; Least squares; Plug-in method; Uncertain sample
When employing model selection methods with oracle properties such as the smoothly clipped absolute deviation (SCAD) and the Adaptive Lasso, it is typical to estimate the smoothing parameter by m-fold cross-validation, for example, m = 10. In problems where the true regression function is sparse and the signals large, such cross-validation typically works well. However, in regression modeling of genomic studies involving Single Nucleotide Polymorphisms (SNP), the true regression functions, while thought to be sparse, do not have large signals. We demonstrate empirically that in such problems, the number of selected variables using SCAD and the Adaptive Lasso, with 10-fold cross-validation, is a random variable that has considerable and surprising variation. Similar remarks apply to non-oracle methods such as the Lasso. Our study strongly questions the suitability of performing only a single run of m-fold cross-validation with any oracle method, and not just the SCAD and Adaptive Lasso.
Adaptive Lasso; Lasso; Model selection; Oracle estimation
This paper addresses the problem of detecting the presence and location of a small low emission source inside an object, when the background noise dominates. This problem arises, for instance, in some homeland security applications. The goal is to reach the signal-to-noise ratio levels in the order of 10−3. A Bayesian approach to this problem is implemented in 2D. The method allows inference not only about the existence of the source, but also about its location. We derive Bayes factors for model selection and estimation of location based on Markov chain Monte Carlo simulation. A simulation study shows that with sufficiently high total emission level, our method can effectively locate the source.
With a binary response Y, the dose-response model under consideration is logistic in flavor with pr(Y=1 | D) = R (1+R)−1, R = λ0 + EAR D, where λ0 is the baseline incidence rate and EAR is the excess absolute risk per gray. The calculated thyroid dose of a person i is expressed as
Qimes is the measured content of radioiodine in the thyroid gland of person i at time tmes,
Mimes is the estimate of the thyroid mass, and fi is the normalizing multiplier. The Qi and Mi are measured with multiplicative errors
ViM, so that
Qimes=QitrViQ (this is classical measurement error model) and
Mitr=MimesViM (this is Berkson measurement error model). Here,
Qitr is the true content of radioactivity in the thyroid gland, and
Mitr is the true value of the thyroid mass. The error in fi is much smaller than the errors in (
Mimes) and ignored in the analysis.
By means of Parametric Full Maximum Likelihood and Regression Calibration (under the assumption that the data set of true doses has lognormal distribution), Nonparametric Full Maximum Likelihood, Nonparametric Regression Calibration, and by properly tuned SIMEX method we study the influence of measurement errors in thyroid dose on the estimates of λ0 and EAR. The simulation study is presented based on a real sample from the epidemiological studies. The doses were reconstructed in the framework of the Ukrainian-American project on the investigation of Post-Chernobyl thyroid cancers in Ukraine, and the underlying subpolulation was artificially enlarged in order to increase the statistical power. The true risk parameters were given by the values to earlier epidemiological studies, and then the binary response was simulated according to the dose-response model.
Berkson measurement error; Chornobyl accident; classical measurement error; estimation of radiation risk; full maximum likelihood estimating procedure; regression calibration; SIMEX estimator; uncertainties in thyroid dose
There has been great public health interest in estimating usual, i.e., long-term average, intake of episodically consumed dietary components that are not consumed daily by everyone, e.g., fish, red meat and whole grains. Short-term measurements of episodically consumed dietary components have zero-inflated skewed distributions. So-called two-part models have been developed for such data in order to correct for measurement error due to within-person variation and to estimate the distribution of usual intake of the dietary component in the univariate case. However, there is arguably much greater public health interest in the usual intake of an episodically consumed dietary component adjusted for energy (caloric) intake, e.g., ounces of whole grains per 1000 kilo-calories, which reflects usual dietary composition and adjusts for different total amounts of caloric intake. Because of this public health interest, it is important to have models to fit such data, and it is important that the model-fitting methods can be applied to all episodically consumed dietary components.
We have recently developed a nonlinear mixed effects model (Kipnis, et al., 2010), and have fit it by maximum likelihood using nonlinear mixed effects programs and methodology (the SAS NLMIXED procedure). Maximum likelihood fitting of such a nonlinear mixed model is generally slow because of 3-dimensional adaptive Gaussian quadrature, and there are times when the programs either fail to converge or converge to models with a singular covariance matrix. For these reasons, we develop a Monte-Carlo (MCMC) computation of fitting this model, which allows for both frequentist and Bayesian inference. There are technical challenges to developing this solution because one of the covariance matrices in the model is patterned. Our main application is to the National Institutes of Health (NIH)-AARP Diet and Health Study, where we illustrate our methods for modeling the energy-adjusted usual intake of fish and whole grains. We demonstrate numerically that our methods lead to increased speed of computation, converge to reasonable solutions, and have the flexibility to be used in either a frequentist or a Bayesian manner.
Bayesian approach; latent variables; measurement error; mixed effects models; nutritional epidemiology; zero-inflated data
Case-control studies are widely used to detect gene-environment interactions in the etiology of complex diseases. Many variables that are of interest to biomedical researchers are difficult to measure on an individual level, e.g. nutrient intake, cigarette smoking exposure, long-term toxic exposure. Measurement error causes bias in parameter estimates, thus masking key features of data and leading to loss of power and spurious/masked associations. We develop a Bayesian methodology for analysis of case-control studies for the case when measurement error is present in an environmental covariate and the genetic variable has missing data. This approach offers several advantages. It allows prior information to enter the model to make estimation and inference more precise. The environmental covariates measured exactly are modeled completely nonparametrically. Further, information about the probability of disease can be incorporated in the estimation procedure to improve quality of parameter estimates, what cannot be done in conventional case-control studies. A unique feature of the procedure under investigation is that the analysis is based on a pseudo-likelihood function therefore conventional Bayesian techniques may not be technically correct. We propose an approach using Markov Chain Monte Carlo sampling as well as a computationally simple method based on an asymptotic posterior distribution. Simulation experiments demonstrated that our method produced parameter estimates that are nearly unbiased even for small sample sizes. An application of our method is illustrated using a population-based case-control study of the association between calcium intake with the risk of colorectal adenoma development.
Bayesian inference; Errors in variables; Gene-environment interactions; Markov Chain Monte Carlo sampling; Missing data; Pseudo-likelihood; Semiparametric methods
We propose statistical methods for comparing phenomics data generated by the Biolog Phenotype Microarray (PM) platform for high-throughput phenotyping. Instead of the routinely used visual inspection of data with no sound inferential basis, we develop two approaches. The first approach is based on quantifying the distance between mean or median curves from two treatments and then applying a permutation test; we also consider a permutation test applied to areas under mean curves. The second approach employs functional principal component analysis. Properties of the proposed methods are investigated on both simulated data and data sets from the PM platform.
functional data analysis; principal components; permutation tests; phenotype microarrays; high-throughput phenotyping; phenomics; Biolog
Over the past two decades, there have been revolutionary developments in life science technologies characterized by high throughput, high efficiency, and rapid computation. Nutritionists now have the advanced methodologies for the analysis of DNA, RNA, protein, low-molecular-weight metabolites, as well as access to bioinformatics databases. Statistics, which can be defined as the process of making scientific inferences from data that contain variability, has historically played an integral role in advancing nutritional sciences. Currently, in the era of systems biology, statistics has become an increasingly important tool to quantitatively analyze information about biological macromolecules. This article describes general terms used in statistical analysis of large, complex experimental data. These terms include experimental design, power analysis, sample size calculation, and experimental errors (type I and II errors) for nutritional studies at population, tissue, cellular, and molecular levels. In addition, we highlighted various sources of experimental variations in studies involving microarray gene expression, real-time polymerase chain reaction, proteomics, and other bioinformatics technologies. Moreover, we provided guidelines for nutritionists and other biomedical scientists to plan and conduct studies and to analyze the complex data. Appropriate statistical analyses are expected to make an important contribution to solving major nutrition-associated problems in humans and animals (including obesity, diabetes, cardiovascular disease, cancer, ageing, and intrauterine fetal retardation).
Bioinformatics; Nutrition research; Statistical analysis; Systems biology