We propose Bayesian methods for Gaussian graphical models that lead to sparse and adaptively shrunk estimators of the precision (inverse covariance) matrix. Our methods are based on lasso-type regularization priors leading to parsimonious parameterization of the precision matrix, which is essential in several applications involving learning relationships among the variables. In this context, we introduce a novel type of selection prior that develops a sparse structure on the precision matrix by making most of the elements exactly zero, in addition to ensuring positive definiteness – thus conducting model selection and estimation simultaneously. More importantly, we extend these methods to analyze clustered data using finite mixtures of Gaussian graphical model and infinite mixtures of Gaussian graphical models. We discuss appropriate posterior simulation schemes to implement posterior inference in the proposed models, including the evaluation of normalizing constants that are functions of parameters of interest, which result from the restriction of positive definiteness on the correlation matrix. We evaluate the operating characteristics of our method via several simulations and demonstrate the application to real data examples in genomics.
bayesian; covariance selection; finite mixtures; gaussian graphical models; infinite mixtures; sparse modeling
We consider the problem of estimating the density of a random variable when precise measurements on the variable are not available, but replicated proxies contaminated with measurement error are available for sufficiently many subjects. Under the assumption of additive measurement errors this reduces to a problem of deconvolution of densities. Deconvolution methods often make restrictive and unrealistic assumptions about the density of interest and the distribution of measurement errors, e.g., normality and homoscedasticity and thus independence from the variable of interest. This article relaxes these assumptions and introduces novel Bayesian semiparametric methodology based on Dirichlet process mixture models for robust deconvolution of densities in the presence of conditionally heteroscedastic measurement errors. In particular, the models can adapt to asymmetry, heavy tails and multimodality. In simulation experiments, we show that our methods vastly outperform a recent Bayesian approach based on estimating the densities via mixtures of splines. We apply our methods to data from nutritional epidemiology. Even in the special case when the measurement errors are homoscedastic, our methodology is novel and dominates other methods that have been proposed previously. Additional simulation results, instructions on getting access to the data set and R programs implementing our methods are included as part of online supplemental materials.
B-spline; Conditional heteroscedasticity; Density deconvolution; Dirichlet process mixture models; Measurement errors; Skew-normal distribution; Variance function
Partial differential equation (PDE) models are commonly used to model complex dynamic systems in applied sciences such as biology and finance. The forms of these PDE models are usually proposed by experts based on their prior knowledge and understanding of the dynamic system. Parameters in PDE models often have interesting scientific interpretations, but their values are often unknown, and need to be estimated from the measurements of the dynamic system in the present of measurement errors. Most PDEs used in practice have no analytic solutions, and can only be solved with numerical methods. Currently, methods for estimating PDE parameters require repeatedly solving PDEs numerically under thousands of candidate parameter values, and thus the computational load is high. In this article, we propose two methods to estimate parameters in PDE models: a parameter cascading method and a Bayesian approach. In both methods, the underlying dynamic process modeled with the PDE model is represented via basis function expansion. For the parameter cascading method, we develop two nested levels of optimization to estimate the PDE parameters. For the Bayesian method, we develop a joint model for data and the PDE, and develop a novel hierarchical model allowing us to employ Markov chain Monte Carlo (MCMC) techniques to make posterior inference. Simulation studies show that the Bayesian method and parameter cascading method are comparable, and both outperform other available methods in terms of estimation accuracy. The two methods are demonstrated by estimating parameters in a PDE model from LIDAR data.
Asymptotic theory; Basis function expansion; Bayesian method; Differential equations; Measurement error; Parameter cascading
This paper addresses the problem of detecting the presence and location of a small low emission source inside an object, when the background noise dominates. This problem arises, for instance, in some homeland security applications. The goal is to reach the signal-to-noise ratio levels in the order of 10−3. A Bayesian approach to this problem is implemented in 2D. The method allows inference not only about the existence of the source, but also about its location. We derive Bayes factors for model selection and estimation of location based on Markov chain Monte Carlo simulation. A simulation study shows that with sufficiently high total emission level, our method can effectively locate the source.
Case-control studies are widely used to detect gene-environment interactions in the etiology of complex diseases. Many variables that are of interest to biomedical researchers are difficult to measure on an individual level, e.g. nutrient intake, cigarette smoking exposure, long-term toxic exposure. Measurement error causes bias in parameter estimates, thus masking key features of data and leading to loss of power and spurious/masked associations. We develop a Bayesian methodology for analysis of case-control studies for the case when measurement error is present in an environmental covariate and the genetic variable has missing data. This approach offers several advantages. It allows prior information to enter the model to make estimation and inference more precise. The environmental covariates measured exactly are modeled completely nonparametrically. Further, information about the probability of disease can be incorporated in the estimation procedure to improve quality of parameter estimates, what cannot be done in conventional case-control studies. A unique feature of the procedure under investigation is that the analysis is based on a pseudo-likelihood function therefore conventional Bayesian techniques may not be technically correct. We propose an approach using Markov Chain Monte Carlo sampling as well as a computationally simple method based on an asymptotic posterior distribution. Simulation experiments demonstrated that our method produced parameter estimates that are nearly unbiased even for small sample sizes. An application of our method is illustrated using a population-based case-control study of the association between calcium intake with the risk of colorectal adenoma development.
Bayesian inference; Errors in variables; Gene-environment interactions; Markov Chain Monte Carlo sampling; Missing data; Pseudo-likelihood; Semiparametric methods
We propose statistical methods for comparing phenomics data generated by the Biolog Phenotype Microarray (PM) platform for high-throughput phenotyping. Instead of the routinely used visual inspection of data with no sound inferential basis, we develop two approaches. The first approach is based on quantifying the distance between mean or median curves from two treatments and then applying a permutation test; we also consider a permutation test applied to areas under mean curves. The second approach employs functional principal component analysis. Properties of the proposed methods are investigated on both simulated data and data sets from the PM platform.
functional data analysis; principal components; permutation tests; phenotype microarrays; high-throughput phenotyping; phenomics; Biolog
We propose a semiparametric Bayesian method for handling measurement error in nutritional epidemiological data. Our goal is to estimate nonparametrically the form of association between a disease and exposure variable while the true values of the exposure are never observed. Motivated by nutritional epidemiological data we consider the setting where a surrogate covariate is recorded in the primary data, and a calibration data set contains information on the surrogate variable and repeated measurements of an unbiased instrumental variable of the true exposure. We develop a flexible Bayesian method where not only is the relationship between the disease and exposure variable treated semiparametrically, but also the relationship between the surrogate and the true exposure is modeled semiparametrically. The two nonparametric functions are modeled simultaneously via B-splines. In addition, we model the distribution of the exposure variable as a Dirichlet process mixture of normal distributions, thus making its modeling essentially nonparametric and placing this work into the context of functional measurement error modeling. We apply our method to the NIH-AARP Diet and Health Study and examine its performance in a simulation study.
B-splines; Dirichlet process prior; Gibbs sampling; Measurement error; Metropolis-Hastings algorithm; Partly linear model
Longitudinal studies of a binary outcome are common in the health, social, and behavioral sciences. In general, a feature of random effects logistic regression models for longitudinal binary data is that the marginal functional form, when integrated over the distribution of the random effects, is no longer of logistic form. Recently, Wang and Louis (2003) proposed a random intercept model in the clustered binary data setting where the marginal model has a logistic form. An acknowledged limitation of their model is that it allows only a single random effect that varies from cluster to cluster. In this paper, we propose a modification of their model to handle longitudinal data, allowing separate, but correlated, random intercepts at each measurement occasion. The proposed model allows for a flexible correlation structure among the random intercepts, where the correlations can be interpreted in terms of Kendall’s τ. For example, the marginal correlations among the repeated binary outcomes can decline with increasing time separation, while the model retains the property of having matching conditional and marginal logit link functions. Finally, the proposed method is used to analyze data from a longitudinal study designed to monitor cardiac abnormalities in children born to HIV-infected women.
Correlated binary data; multivariate normal distribution; probability integral transformation
Massively Parallel Signature Sequencing (MPSS) is a high-throughput counting-based technology available for gene expression profiling. It produces output that is similar to Serial Analysis of Gene Expression (SAGE) and is ideal for building complex relational databases for gene expression. Our goal is to compare the in vivo global gene expression profiles of tissues infected with different strains of Salmonella obtained using the MPSS technology. In this article, we develop an exact ANOVA type model for this count data using a zero-inflated Poisson (ZIP) distribution, different from existing methods that assume continuous densities. We adopt two Bayesian hierarchical models—one parametric and the other semiparametric with a Dirichlet process prior that has the ability to “borrow strength” across related signatures, where a signature is a specific arrangement of the nucleotides, usually 16-21 base-pairs long. We utilize the discreteness of Dirichlet process prior to cluster signatures that exhibit similar differential expression profiles. Tests for differential expression are carried out using non-parametric approaches, while controlling the false discovery rate. We identify several differentially expressed genes that have important biological significance and conclude with a summary of the biological discoveries.
Bayesian semiparametric modeling; Bovine Salmonella infection; Dirichlet process mixture; Markov chain Monte Carlo (MCMC); Massively Parallel Signature Sequencing (MPSS); zero-inflated Poisson
In this article, we present new methods to analyze data from an experiment using rodent models to investigate the role of p27, an important cell-cycle mediator, in early colon carcinogenesis. The responses modeled here are essentially functions nested within a two-stage hierarchy. Standard functional data analysis literature focuses on a single stage of hierarchy and conditionally independent functions with near white noise. However, in our experiment, there is substantial biological motivation for the existence of spatial correlation among the functions, which arise from the locations of biological structures called colonic crypts: this possible functional correlation is a phenomenon we term crypt signaling. Thus, as a point of general methodology, we require an analysis that allows for functions to be correlated at the deepest level of the hierarchy. Our approach is fully Bayesian and uses Markov chain Monte Carlo methods for inference and estimation. Analysis of this data set gives new insights into the structure of p27 expression in early colon carcinogenesis and suggests the existence of significant crypt signaling. Our methodology uses regression splines, and because of the hierarchical nature of the data, dimension reduction of the covariance matrix of the spline coefficients is important: we suggest simple methods for overcoming this problem.
Bayesian methods; Carcinogenesis; Functional data analysis; Hierarchical model; Markov chain Monte Carlo; Mixed models; Regression splines; Semiparametric methods; Spatial correlation
We propose Bayesian parametric and semiparametric partially linear regression methods to analyze the outcome-dependent follow-up data when the random time of a follow-up measurement of an individual depends on the history of both observed longitudinal outcomes and previous measurement times. We begin with the investigation of the simplifying assumptions of Lipsitz, Fitzmaurice, Ibrahim, Gelber, and Lipshultz, and present a new model for analyzing such data by allowing subject-specific correlations for the longitudinal response and by introducing a subject-specific latent variable to accommodate the association between the longitudinal measurements and the follow-up times. An extensive simulation study shows that our Bayesian partially linear regression method facilitates accurate estimation of the true regression line and the regression parameters. We illustrate our new methodology using data from a longitudinal observational study.
Bayesian cubic smoothing spline; Latent variable; Partially linear model